Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmlu translated professionally by OpenAI #2312

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
ca2dd8b
First commit.
giuliolovisotto Sep 17, 2024
bad44e7
Update README.md (#2297)
SYusupov Sep 17, 2024
b1aedcd
repr bug (#2315)
baberabb Sep 17, 2024
a0d66fb
Update neuron backend (#2314)
dacorvo Sep 18, 2024
5be4d01
Fixed dummy model (#2339)
Am1n3e Sep 24, 2024
b59b9b9
add a note for missing dependencies (#2336)
eldarkurtic Sep 24, 2024
81b13e4
load metric with `evaluate` (#2351)
baberabb Sep 26, 2024
2476ad3
fix writeout script (#2350)
baberabb Sep 26, 2024
bebc227
Treat tags in python tasks the same as yaml tasks (#2288)
giuliolovisotto Sep 26, 2024
e6726a7
change group to tags in task `eus_exams` task configs (#2320)
baberabb Sep 26, 2024
969ea42
change glianorex to test split (#2332)
baberabb Sep 26, 2024
4ca6f35
mmlu-pro: add newlines to task descriptions (not leaderboard) (#2334)
baberabb Sep 26, 2024
641411f
Added TurkishMMLU to LM Evaluation Harness (#2283)
ArdaYueksel Sep 26, 2024
0bb44af
add mmlu readme (#2282)
baberabb Sep 26, 2024
c51deda
openai: better error messages; fix greedy matching (#2327)
baberabb Sep 26, 2024
0e5ef9d
Rename this file.
giuliolovisotto Sep 23, 2024
ad1ce4e
Add groups/tasks descriptor.
giuliolovisotto Sep 23, 2024
6f58636
Refactored structure.
giuliolovisotto Sep 27, 2024
d685ed5
Merge branch 'EleutherAI:main' into task/2305-openai-multilingual-mmlu
giuliolovisotto Sep 27, 2024
b6f2fe3
Updated readme.md
giuliolovisotto Sep 27, 2024
a2a9e8e
Go back to baber ds which has subject split.
giuliolovisotto Sep 27, 2024
4fff660
Re-add english.
giuliolovisotto Sep 27, 2024
41e8a92
Point to dataset fixed paths.
giuliolovisotto Oct 1, 2024
6807c19
Merge branch 'EleutherAI:main' into task/2305-openai-multilingual-mmlu
giuliolovisotto Oct 1, 2024
c8ca981
more explicit group info.
giuliolovisotto Oct 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
75 changes: 75 additions & 0 deletions lm_eval/tasks/openai_mmmlu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# OpenAI MMMLU

### Technical Report

The task/dataset contains a professional, human-translation of the common MMLU task (originally in the English language) into 14 different languages.

Title: OpenAI o1 System Card

Homepage: https://openai.com/index/openai-o1-system-card/

Technical Report: https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf

[Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300) by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).

### Groups and Tasks

The `default` variant to the common MMLU-style prompting, output from a `--write-out`:

```bash
[...]

document 0; context prompt (starting on next line):
The following are multiple choice questions (with answers) about anatomy.

Ermitteln Sie den Grad für die gegebene Felderweiterung Q(sqrt(2), sqrt(3), sqrt(18)) über Q.
A. 0
B. 4
C. 2
D. 6
Antwort:
(end of prompt on previous line)
target string or answer choice index (starting on next line):
B
(end of target on previous line)
```

Note that
* the `description` is in English, while the question itself is in the target language, and the "Answer: " prefix is in the target language [this last bit was my choice].
* in the paper, the prompt is [significantly different](https://github.com/openai/simple-evals/blob/2df1a92bbddb8c89fbeb3670e2dd125b10632bca/common.py#L12) and includes COT plus [generous regexps](https://github.com/openai/simple-evals/blob/2df1a92bbddb8c89fbeb3670e2dd125b10632bca/common.py#L29) (filters) to extract the answer. I am of the opinion one should implement a different variant to reproduce those results.
* split information is not present in the [dataset on hf](https://huggingface.co/datasets/openai/MMMLU), so currently this dataset doesn't support fewshot or decontamination.

#### Groups

* `openai_mmmlu_default` # supergroup of the following groups
* `openai_mmmlu_default_ar_xy`
* `openai_mmmlu_default_bn_bd`
* `openai_mmmlu_default_de_de`
* `openai_mmmlu_default_es_la`
* `openai_mmmlu_default_fr_fr`
* `openai_mmmlu_default_hi_in`
* `openai_mmmlu_default_id_id`
* `openai_mmmlu_default_it_it`
* `openai_mmmlu_default_ja_jp`
* `openai_mmmlu_default_ko_kr`
* `openai_mmmlu_default_pt_br`
* `openai_mmmlu_default_sw_ke`
* `openai_mmmlu_default_yo_ng`
* `openai_mmmlu_default_zh_ch`

#### Tasks

* `openai_mmmlu_default_<language>_<subject>`: The mmlu translation combined with prompt, language and subject from mmlu.

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted? Yes, it would be the `default` folder.
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
12 changes: 12 additions & 0 deletions lm_eval/tasks/openai_mmmlu/_default_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
dataset_path: giuliolovisotto/openai_multilingual_mmlu # a copy of `cais/mmlu` with no auxiliary_train split
fewshot_split: null
fewshot_config: null
output_type: multiple_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
dataset_kwargs:
trust_remote_code: true
176 changes: 176 additions & 0 deletions lm_eval/tasks/openai_mmmlu/_generate_configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
"""
Caveats:
* I asked an LLM to translate "Answer:" into the 14 languages following, output of its conversation below.
* OpenAI uses slightly different formatting for Q/A -> https://github.com/openai/simple-evals/blob/main/common.py
a. They use a list of regexps in various langauages to extract a line that contains "Answer: .." to get the response,
b. They have an actual prompt `QUERY_TEMPLATE_MULTICHOICE` with cot priming - I added this.

======
Can you make me a dictionary that maps these languages to the way one would say "Answer:" in those languages? Let me know of when you have doubts and why

LANGUAGES = {
"AR_XY": "Arabic (Generic)",
"BN_BD": "Bengali (Bangladesh)",
"DE_DE": "German (Germany)",
"ES_LA": "Spanish (Latin America)",
"FR_FR": "French (France)",
"HI_IN": "Hindi (India)",
"ID_ID": "Indonesian (Indonesia)",
"IT_IT": "Italian (Italy)",
"JA_JP": "Japanese (Japan)",
"KO_KR": "Korean (South Korea)",
"PT_BR": "Portuguese (Brazil)",
"ZH_CN": "Chinese (China)",
"SW_KE": "Swahili (Kenya)",
"YO_NG": "Yoruba (Nigeria)"
}

I'll create a dictionary that maps these languages to their respective translations of "Answer:". I'll think through this step-by-step and highlight any uncertainties.
First, I'll create a new dictionary with the same keys as the LANGUAGES dictionary.
For each language, I'll provide the translation of "Answer:".
I'll note any doubts or uncertainties as I go.

Here's the process:
AR_XY (Arabic): "الإجابة:" No doubts here.
BN_BD (Bengali): "উত্তর:" No doubts.
DE_DE (German): "Antwort:" Confident about this one.
ES_LA (Spanish): "Respuesta:" Note: This is standard Spanish, but there might be slight variations in Latin American countries.
FR_FR (French): "Réponse :" Note: In French, there's usually a space before the colon.
HI_IN (Hindi): "उत्तर:" No doubts.
ID_ID (Indonesian): "Jawaban:" Fairly confident, but there might be variations.
IT_IT (Italian): "Risposta:" No doubts.
JA_JP (Japanese): "答え:" Note: Japanese uses a full-width colon (:) instead of a half-width one (:).
KO_KR (Korean): "답변:" Fairly confident, but there might be other ways to express this.
PT_BR (Portuguese): "Resposta:" No doubts for Brazilian Portuguese.
ZH_CN (Chinese): "答案:" Note: Using Simplified Chinese characters. Also uses full-width colon.
SW_KE (Swahili): "Jibu:" Some uncertainty here, as my knowledge of Swahili is not as extensive.
YO_NG (Yoruba): "Ìdáhùn:" Significant uncertainty. My knowledge of Yoruba is limited, and this might not be the most common or accurate translation.

Main points of uncertainty:
For ES_LA (Spanish), there might be slight regional variations in Latin America.
For SW_KE (Swahili) and YO_NG (Yoruba), my confidence is lower due to limited knowledge of these languages.
For some languages, there might be multiple ways to express "Answer:" depending on context or formality level.

It would be advisable to have these translations verified by native speakers or language experts, especially for Swahili and Yoruba.
======


"""

import argparse
import logging
from pathlib import Path

import yaml
from tqdm import tqdm
from utils import ANSWERS, LANGUAGES, SUBJECTS


eval_logger = logging.getLogger("lm-eval")


def parse_args():
parser = argparse.ArgumentParser()
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()

PROMPT_FLAVOURS = {
# default in the version of standard MMLU
# honestly I think we should translate these into the target language.
"default": {
"description": "The following are multiple choice questions (with answers) about {}.\n\n",
"prompt": "{{Question.strip()}}\nA. {{A.strip()}}\nB. {{B.strip()}}\nC. {{C.strip()}}\nD. {{D.strip()}}\n",
"add_answer": True,
},
# this one in the version found on simple-evals from openai
# "cot": {
# "description": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.\n\n",
# "prompt": "{{Question.strip()}}\n\nA) {{A.strip()}}\nB) {{B.strip()}}\nC) {{C.strip()}}\nD) {{D.strip()}}\n",
# "add_answer": False
# }
}

ALL_CATEGORIES = []
ALL_TASKS = []
for prompt_key, prompt_info in PROMPT_FLAVOURS.items():
for langgode, language_full_name in tqdm(LANGUAGES.items()):
_langgode = langgode.lower()
out_folder = Path(prompt_key) / _langgode
out_folder.mkdir(exist_ok=True, parents=True)
for subject, category in SUBJECTS.items():
if category not in ALL_CATEGORIES:
ALL_CATEGORIES.append(category)

yaml_dict = {
"include": "../../_default_template_yaml",
"tag": f"openai_mmmlu_{prompt_key}_{_langgode}_{category}",
"task": f"openai_mmmlu_{prompt_key}_{_langgode}_{subject}",
"task_alias": f'{_langgode} {subject.replace("_", " ")}',
"dataset_name": subject,
"test_split": langgode,
"description": prompt_info["description"].format(subject),
"doc_to_text": prompt_info["prompt"]
+ (ANSWERS[langgode] if prompt_info["add_answer"] else ""),
"doc_to_choice": ["A", "B", "C", "D"],
"doc_to_target": "{{Answer.strip()}}",
}

file_save_path = (
out_folder / f"openai_mmmlu_{prompt_key}_{subject}.yaml"
)
eval_logger.info(
f"Saving yaml for subset {_langgode},{subject} to {file_save_path}"
)
with open(file_save_path, "w", encoding="utf-8") as yaml_file:
yaml.dump(
yaml_dict,
yaml_file,
allow_unicode=True,
default_style='"',
)

# (sub)group for prompt/language pair
subgroup_info_path = (
out_folder / f"_{prompt_key}_{_langgode}_group_info.yaml"
)
with open(subgroup_info_path, "w", encoding="utf-8") as yaml_file:
# list of task for this pair of prompt/language
_tasks = [
f"openai_mmmlu_{prompt_key}_{_langgode}_{_subject}"
for _subject in SUBJECTS.keys()
]
dct = {
"group": f"openai_mmmlu_{prompt_key}_{_langgode}",
"task": _tasks,
"aggregate_metric_list": [
{"metric": "acc", "weight_by_size": True}
],
"metadata": {"version": "1.0.0"},
}
ALL_TASKS.extend(_tasks)
yaml.dump(
dct,
yaml_file,
indent=4,
default_flow_style=False,
)
# (super)group for promptkey
out_folder = Path(prompt_key)
supergroup_info_path = out_folder / f"_openai_mmmlu_{prompt_key}.yaml"
with open(supergroup_info_path, "w", encoding="utf-8") as yaml_file:
dct = {
"group": f"openai_mmmlu_{prompt_key}",
"task": ALL_TASKS,
"aggregate_metric_list": [{"metric": "acc", "weight_by_size": True}],
"metadata": {"version": "1.0.0"},
}

yaml.dump(
dct,
yaml_file,
indent=4,
default_flow_style=False,
)
Loading
Loading