New MLMM tasks #99

malteos · 2023-11-08T07:31:39Z

This PR adds the multilingual from https://github.com/nlp-uoregon/mlmm-evaluation/ (arc, hellaswag, mmlu, truthfulqa).

The datasets support 26 languages: Russian, German, Chinese, French, Spanish, Italian, Dutch, Vietnamese, Indonesian, Arabic, Hungarian, Romanian, Danish, Slovak, Ukrainian, Catalan, Serbian, Croatian, Hindi, Bengali, Tamil, Nepali, Malayalam, Marathi, Telugu, and Kannada.

All task data is mirrored to HF hub (e.g., https://huggingface.co/datasets/malteos/m_truthfulqa) and is downloaded automatically.

All tasks:

'mlmm_arc_ar', 'mlmm_arc_bn', 'mlmm_arc_ca', 'mlmm_arc_da', 'mlmm_arc_de', 'mlmm_arc_es', 'mlmm_arc_eu', 'mlmm_arc_fr', 'mlmm_arc_gu', 'mlmm_arc_hi', 'mlmm_arc_hr', 'mlmm_arc_hu', 'mlmm_arc_hy', 'mlmm_arc_id', 'mlmm_arc_it', 'mlmm_arc_kn', 'mlmm_arc_ml', 'mlmm_arc_mr', 'mlmm_arc_ne', 'mlmm_arc_nl', 'mlmm_arc_pt', 'mlmm_arc_ro', 'mlmm_arc_ru', 'mlmm_arc_sk', 'mlmm_arc_sr', 'mlmm_arc_sv', 'mlmm_arc_ta', 'mlmm_arc_te', 'mlmm_arc_uk', 'mlmm_arc_vi', 'mlmm_arc_zh', 'mlmm_hellaswag_ar', 'mlmm_hellaswag_bn', 'mlmm_hellaswag_ca', 'mlmm_hellaswag_da', 'mlmm_hellaswag_de', 'mlmm_hellaswag_es', 'mlmm_hellaswag_eu', 'mlmm_hellaswag_fr', 'mlmm_hellaswag_gu', 'mlmm_hellaswag_hi', 'mlmm_hellaswag_hr', 'mlmm_hellaswag_hu', 'mlmm_hellaswag_hy', 'mlmm_hellaswag_id', 'mlmm_hellaswag_it', 'mlmm_hellaswag_kn', 'mlmm_hellaswag_ml', 'mlmm_hellaswag_mr', 'mlmm_hellaswag_ne', 'mlmm_hellaswag_nl', 'mlmm_hellaswag_pt', 'mlmm_hellaswag_ro', 'mlmm_hellaswag_ru', 'mlmm_hellaswag_sk', 'mlmm_hellaswag_sr', 'mlmm_hellaswag_sv', 'mlmm_hellaswag_ta', 'mlmm_hellaswag_te', 'mlmm_hellaswag_uk', 'mlmm_hellaswag_vi', 'mlmm_hellaswag_zh', 'mlmm_mmlu_ar', 'mlmm_mmlu_bn', 'mlmm_mmlu_ca', 'mlmm_mmlu_da', 'mlmm_mmlu_de', 'mlmm_mmlu_es', 'mlmm_mmlu_eu', 'mlmm_mmlu_fr', 'mlmm_mmlu_gu', 'mlmm_mmlu_hi', 'mlmm_mmlu_hr', 'mlmm_mmlu_hu', 'mlmm_mmlu_hy', 'mlmm_mmlu_id', 'mlmm_mmlu_it', 'mlmm_mmlu_kn', 'mlmm_mmlu_ml', 'mlmm_mmlu_mr', 'mlmm_mmlu_ne', 'mlmm_mmlu_nl', 'mlmm_mmlu_pt', 'mlmm_mmlu_ro', 'mlmm_mmlu_ru', 'mlmm_mmlu_sk', 'mlmm_mmlu_sr', 'mlmm_mmlu_sv', 'mlmm_mmlu_ta', 'mlmm_mmlu_te', 'mlmm_mmlu_uk', 'mlmm_mmlu_vi', 'mlmm_mmlu_zh', 'mlmm_truthfulqa_ar', 'mlmm_truthfulqa_bn', 'mlmm_truthfulqa_ca', 'mlmm_truthfulqa_da', 'mlmm_truthfulqa_de', 'mlmm_truthfulqa_es', 'mlmm_truthfulqa_eu', 'mlmm_truthfulqa_fr', 'mlmm_truthfulqa_gu', 'mlmm_truthfulqa_hi', 'mlmm_truthfulqa_hr', 'mlmm_truthfulqa_hu', 'mlmm_truthfulqa_hy', 'mlmm_truthfulqa_id', 'mlmm_truthfulqa_it', 'mlmm_truthfulqa_kn', 'mlmm_truthfulqa_ml', 'mlmm_truthfulqa_mr', 'mlmm_truthfulqa_ne', 'mlmm_truthfulqa_nl', 'mlmm_truthfulqa_pt', 'mlmm_truthfulqa_ro', 'mlmm_truthfulqa_ru', 'mlmm_truthfulqa_sk', 'mlmm_truthfulqa_sr', 'mlmm_truthfulqa_sv', 'mlmm_truthfulqa_ta', 'mlmm_truthfulqa_te', 'mlmm_truthfulqa_uk', 'mlmm_truthfulqa_vi', 'mlmm_truthfulqa_zh'

jjbuschhoff · 2023-11-09T12:57:18Z

lm_eval/tasks/mlmm/multilingual_truthfulqa.py

+"""
+
+# The default QA preset prompt for all models.
+QA_PROMPT = (


The default QA prompt being English may bias the results in favour of English when being compared to the base truthful_qa benchmark.

Fair point. But I would keep it as it is to be comparable with the literature.

Alternatively, we can make language-specific prompts but then the task should be named differently.

jjbuschhoff · 2023-11-09T13:48:08Z

lm_eval/tasks/mlmm/multilingual_mmlu.py

+    NUM_FEW_SHOT = 25
+    DATASET_NAME = None
+
+    def __init__(self, lang):


The MMLU dataset is subdivided into various categories (cf. https://github.com/OpenGPTX/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_test.py), this would be a nice option to have for comparison purposes with the english task. However, the dataset as loaded here is not split by subject, the dataset builder script would have to be modified.

Subject-level tasks are added in the latest commit. To achieve that, I uploaded the datasets to HF and updated the builder script to include the subject.

jjbuschhoff · 2023-11-09T13:58:08Z

lm_eval/tasks/mlmm/multilingual_mmlu.py

+        return True
+
+    def validation_docs(self):
+        return map(self._process_doc, self.dataset["validation"])


Strangely, the dataset builder loads {lang}_dev.json as the validation split instead of {lang}_val.json

Could be easily changed but I could keep it like this for consistency.

… mlmm

malteos · 2023-11-21T16:23:24Z

See my comments + updates. @jjbuschhoff

jjbuschhoff

Works as expected when using the dataset at malteos/m_mmlu.

jjbuschhoff · 2023-11-29T08:42:16Z

lm_eval/tasks/mlmm/multilingual_mmlu.py

@@ -49,9 +123,10 @@ class GeneralHendrycksTest(MultipleChoiceTask):
    NUM_FEW_SHOT = 25
    DATASET_NAME = None

-    def __init__(self, lang):
+    def __init__(self, lang, subject=None):
        self.DATASET_NAME = f"mmlu_{lang}"
        self.DATASET_PATH = get_mlmm_dataset_path("datasets/m_mmlu")


Shouldn't this be self.DATASET_PATH = "malteos/m_mmlu" instead?

malteos · 2023-12-04T11:34:35Z

@jjbuschhoff HF data is no fully integrated and automatically downloaded. Can this be merged then?

malteos requested review from KlaudiaTH and jjbuschhoff November 8, 2023 07:31

Added MLMM tasks

2233a77

malteos force-pushed the mlmm branch from f27e912 to 2233a77 Compare November 8, 2023 09:43

fixed truthfulqa instantiation

804ebdc

jjbuschhoff reviewed Nov 9, 2023

View reviewed changes

jjbuschhoff mentioned this pull request Nov 16, 2023

Integrated various German tasks #97

Closed

malteos added 2 commits November 21, 2023 17:18

m_mmlu with subject tasks

066235c

Merge branch 'mlmm' of github.com:OpenGPTX/lm-evaluation-harness into…

e87f754

… mlmm

malteos requested a review from jjbuschhoff November 21, 2023 16:23

jjbuschhoff requested changes Nov 29, 2023

View reviewed changes

use HF hub data loaders

c8b5316

jjbuschhoff approved these changes Dec 4, 2023

View reviewed changes

jjbuschhoff merged commit 0822bb2 into master Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New MLMM tasks #99

New MLMM tasks #99

Uh oh!

malteos commented Nov 8, 2023 •

edited

Loading

Uh oh!

jjbuschhoff Nov 9, 2023

Uh oh!

malteos Nov 21, 2023

Uh oh!

jjbuschhoff Nov 9, 2023

Uh oh!

malteos Nov 21, 2023

Uh oh!

jjbuschhoff Nov 9, 2023

Uh oh!

malteos Nov 21, 2023

Uh oh!

malteos commented Nov 21, 2023

Uh oh!

jjbuschhoff left a comment

Uh oh!

jjbuschhoff Nov 29, 2023

Uh oh!

malteos commented Dec 4, 2023

Uh oh!

Uh oh!

New MLMM tasks #99

New MLMM tasks #99

Uh oh!

Conversation

malteos commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjbuschhoff Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

malteos Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

jjbuschhoff Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

malteos Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

jjbuschhoff Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

malteos Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

malteos commented Nov 21, 2023

Uh oh!

jjbuschhoff left a comment

Choose a reason for hiding this comment

Uh oh!

jjbuschhoff Nov 29, 2023

Choose a reason for hiding this comment

Uh oh!

malteos commented Dec 4, 2023

Uh oh!

Uh oh!

malteos commented Nov 8, 2023 •

edited

Loading