Integrated various German tasks #97

jjbuschhoff · 2023-10-20T15:45:53Z

Cherrypicked some German task implementations from https://github.com/bjoernpl/lm-evaluation-harness-de/tree/mmlu_de.
Contains Hellaswag_de, TruthfulQA_de, ARC-Challenge_de and HendrycksTest_de.

lm_eval/tasks/opengptx/hellaswag_de.py

KlaudiaTH · 2023-11-05T18:17:53Z

lm_eval/tasks/opengptx/hendrycks_test_de.py

+    DATASET_NAME = None
+
+    def __init__(self, subject):
+        self.DATASET_NAME = subject


There seems to be a problem with the dataset: Even when a dataset config is specified, documents from many different subjects are returned, e.g. for ogx_hendrycksTest_de-abstract_algebra:

Es folgen multiple-choice Fragen (mit Antworten) über das Thema Abstrakte Algebra.

Frage: Welche der folgenden listet die elektromagnetischen Spektralbereiche in absteigender Reihenfolge der Wellenlänge auf?
Optionen:
A. Ultraviolett, sichtbar, Infrarot, Röntgen
B. Röntgen, sichtbar, Ultraviolett, Infrarot
C. Röntgen, Ultraviolett, sichtbar, Infrarot
D. Infrarot, sichtbar, Ultraviolett, Röntgen
Antwort:

This question is actually a physics question, and not from abstract algebra.

If this cannot be worked around, it probably doesn't make sense to include this benchmark before the bug is fixed in the HF dataset.

It indeed appears that the HF dataset builder script is faulty and suppllies the entire test-set over all subjects regardless of which subject is supplied in datasets.load_dataset().

lm_eval/tasks/opengptx/hendrycks_test_de.py

lm_eval/tasks/opengptx/truthfulqa_de.py

jjbuschhoff · 2023-11-16T12:01:56Z

I'd advocate in favour of closing this since #99 is a superset of the here implemented tasks.

malteos · 2023-11-21T15:03:03Z

I'd advocate in favour of closing this since #99 is a superset of the here implemented tasks.

Sure about this? Even though they are technically the same tasks but I think the translations were created differently. I'd recommend having both variations implemented to be comparable to the literature.

jjbuschhoff added 2 commits October 20, 2023 17:40

implemented german tasks

477b3a4

linting

4702f70

KlaudiaTH marked this pull request as ready for review November 5, 2023 17:59

KlaudiaTH reviewed Nov 7, 2023

View reviewed changes

jjbuschhoff added 2 commits November 7, 2023 17:17

remove non-functional parts of hendrycks_test_de

52e82af

provided fewshot error message

7ead5c1

jjbuschhoff closed this Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrated various German tasks #97

Integrated various German tasks #97

jjbuschhoff commented Oct 20, 2023

KlaudiaTH Nov 5, 2023

jjbuschhoff Nov 7, 2023 •

edited

Loading

jjbuschhoff commented Nov 16, 2023

malteos commented Nov 21, 2023

Integrated various German tasks #97

Integrated various German tasks #97

Conversation

jjbuschhoff commented Oct 20, 2023

KlaudiaTH Nov 5, 2023

Choose a reason for hiding this comment

jjbuschhoff Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

jjbuschhoff commented Nov 16, 2023

malteos commented Nov 21, 2023

jjbuschhoff Nov 7, 2023 •

edited

Loading