MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Weihao Xuan^* Rui Yang^† Heli Qi^‡ Qingcheng Zeng^§
Yunze Xiao^¶ Aosong Feng^# Dairui Liu^♯ Yun Xing^♮
Junjue Wang^* Fan Gao^* Jinghui Lu^♭ Yuang Jiang^♭
Huitao Li^† Xin Li^† Kunyu Yu^† Ruihai Dong^♯
Shangding Gu^⊕ Yuekang Li^⊗ Xiaofei Xie^Δ Felix Juefei-Xu^Λ
Foutse Khomh^Ω Osamu Yoshie^‡ Qingyu Chen^# Douglas Teodoro^Ψ
Nan Liu^† Randy Goebel^Γ Lei Ma^* Edison Marrese-Taylor^*
Shijian Lu^♮ Yusuke Iwasawa^* Yutaka Matsuo^* Irene Li^*

^*The University of Tokyo, Japan, ^†Duke-NUS Medical School, Singapore, ^‡Waseda University, Japan,
^§Northwestern University, United States, ^¶Carnegie Mellon University, United States,
^#Yale University, United States, ^♯University College Dublin, Ireland, ^♮Nanyang Technological University, Singapore,
^♭Smartor Inc, Japan, ^⊕University of California, Berkeley, United States, ^⊗University of New South Wales, Australia,
^ΔSingapore Management University, Singapore, ^ΛNew York University, United States,
^ΩPolytechnique Montreal, Canada, ^ΨUniversity of Geneva, Switzerland, ^ΓUniversity of Alberta, Canada

Overview

MMLU-ProX is a multilingual benchmark that builds upon MMLU-Pro, extending to 29 typologically diverse languages, designed to evaluate large language models' reasoning capabilities across linguistic and cultural boundaries.

MMLU-ProX addresses critical limitations in existing multilingual benchmarks by:

Extending coverage to 29 typologically diverse languages
Building upon the challenging, reasoning-focused design of MMLU-Pro
Employing a rigorous semi-automatic translation process with expert validation
Ensuring conceptual accuracy, terminological consistency, and cultural relevance

News

[May 2025] 🎉 MMLU-ProX now contains 29 languages, all available on Huggingface! We provide both lite version and full version! We will update the paper soon!
[March 2025] 🎉 MMLU-ProX's evaluation is now available on lm-evaluation-harness!
[March 2025] 🎉 MMLU-ProX is now available on Hugging Face!
[March 2025] We are still expanding this dataset to more languages! Stay tuned!

Usage

To reproduce the results posted in our paper, we support vLLM evaluation by lm-evaluation-harness by the following command:

model_id=<your-target-model>
tensor_parallel_size=<number-of-gpu-you-want-to-use>
lang=<your-target-language>

python -m lm_eval \
  --model vllm \
  --model_args pretrained=${model_id},tensor_parallel_size=${tensor_parallel_size},dtype=auto,gpu_memory_utilization=0.9 \
  --batch_size auto \
  --tasks mmlu_prox_${lang}

Please refer to lm-evaluation-harness for more details about how to setup.

Note: Please install vllm=0.7.3 to reproduce our results other than Llama3.1-405B which is evaluated by vllm=0.6.6.

Citation

@misc{xuan2025mmluproxmultilingualbenchmarkadvanced,
      title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation}, 
      author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Aosong Feng and Dairui Liu and Yun Xing and Junjue Wang and Fan Gao and Jinghui Lu and Yuang Jiang and Huitao Li and Xin Li and Kunyu Yu and Ruihai Dong and Shangding Gu and Yuekang Li and Xiaofei Xie and Felix Juefei-Xu and Foutse Khomh and Osamu Yoshie and Qingyu Chen and Douglas Teodoro and Nan Liu and Randy Goebel and Lei Ma and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
      year={2025},
      eprint={2503.10497},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.10497}, 
}

Contact

For questions or feedback about MMLU-ProX, please open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Overview

News

Usage

Citation

Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

weihao1115/MMLU-ProX

Folders and files

Latest commit

History

Repository files navigation

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Overview

News

Usage

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages