Skip to content

weihao1115/MMLU-ProX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 

Repository files navigation

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Full Version Full Version arXiv

*The University of Tokyo, Japan, Duke-NUS Medical School, Singapore, Waseda University, Japan,
§Northwestern University, United States, Carnegie Mellon University, United States,
#Yale University, United States, University College Dublin, Ireland, Nanyang Technological University, Singapore,
Smartor Inc, Japan, University of California, Berkeley, United States, University of New South Wales, Australia,
ΔSingapore Management University, Singapore, ΛNew York University, United States,
ΩPolytechnique Montreal, Canada, ΨUniversity of Geneva, Switzerland, ΓUniversity of Alberta, Canada

Overview

MMLU-ProX is a multilingual benchmark that builds upon MMLU-Pro, extending to 29 typologically diverse languages, designed to evaluate large language models' reasoning capabilities across linguistic and cultural boundaries.

MMLU-ProX addresses critical limitations in existing multilingual benchmarks by:

  • Extending coverage to 29 typologically diverse languages
  • Building upon the challenging, reasoning-focused design of MMLU-Pro
  • Employing a rigorous semi-automatic translation process with expert validation
  • Ensuring conceptual accuracy, terminological consistency, and cultural relevance

News

  • [May 2025] 🎉 MMLU-ProX now contains 29 languages, all available on Huggingface! We provide both lite version and full version! We will update the paper soon!
  • [March 2025] 🎉 MMLU-ProX's evaluation is now available on lm-evaluation-harness!
  • [March 2025] 🎉 MMLU-ProX is now available on Hugging Face!
  • [March 2025] We are still expanding this dataset to more languages! Stay tuned!

Usage

To reproduce the results posted in our paper, we support vLLM evaluation by lm-evaluation-harness by the following command:

model_id=<your-target-model>
tensor_parallel_size=<number-of-gpu-you-want-to-use>
lang=<your-target-language>

python -m lm_eval \
  --model vllm \
  --model_args pretrained=${model_id},tensor_parallel_size=${tensor_parallel_size},dtype=auto,gpu_memory_utilization=0.9 \
  --batch_size auto \
  --tasks mmlu_prox_${lang}

Please refer to lm-evaluation-harness for more details about how to setup.

Note: Please install vllm=0.7.3 to reproduce our results other than Llama3.1-405B which is evaluated by vllm=0.6.6.

Citation

@misc{xuan2025mmluproxmultilingualbenchmarkadvanced,
      title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation}, 
      author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Aosong Feng and Dairui Liu and Yun Xing and Junjue Wang and Fan Gao and Jinghui Lu and Yuang Jiang and Huitao Li and Xin Li and Kunyu Yu and Ruihai Dong and Shangding Gu and Yuekang Li and Xiaofei Xie and Felix Juefei-Xu and Foutse Khomh and Osamu Yoshie and Qingyu Chen and Douglas Teodoro and Nan Liu and Randy Goebel and Lei Ma and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
      year={2025},
      eprint={2503.10497},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.10497}, 
}

Contact

For questions or feedback about MMLU-ProX, please open an issue.

About

The official repo of "A Multilingual Benchmark for Advanced Large Language Model Evaluation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •