MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of Large Language Models (LLMs) across 11 Indic languages. It spans 8 domains and 42 subjects, reflecting both general and culturally specific knowledge from India.
This repository contains code for evaluating language models on the MILU benchmark using the lm-eval-harness framework.
- Python 3.7+
lm-eval-harness
library- HuggingFace Transformers
- vLLM (optional, for faster inference)
- Clone this repository:
git clone --depth 1 https://github.com/AI4Bharat/MILU.git
cd MILU
pip install -e .
-
Request access to the HuggingFace 🤗 dataset here.
-
Set up your environment variables:
export HF_HOME=/path/to/HF_CACHE/if/needed
export HF_TOKEN=YOUR_HUGGINGFACE_TOKEN
The following languages are supported for MILU:
- Bengali
- English
- Gujarati
- Hindi
- Kannada
- Malayalam
- Marathi
- Odia
- Punjabi
- Tamil
- Telugu
For HuggingFace models, you may use the following sample command:
lm_eval --model hf \
--model_args 'pretrained=google/gemma-2-27b-it,temperature=0.0,top_p=1.0,parallelize=True' \
--tasks milu \
--batch_size auto:40 \
--log_samples \
--output_path $EVAL_OUTPUT_PATH \
--max_batch_size 64 \
--num_fewshot 5 \
--apply_chat_template
For vLLM-compatible models, you may use the following sample command:
lm_eval --model vllm \
--model_args 'pretrained=meta-llama/Llama-3.2-3B,tensor_parallel_size=$N_GPUS' \
--gen_kwargs 'temperature=0.0,top_p=1.0' \
--tasks milu \
--batch_size auto \
--log_samples \
--output_path $EVAL_OUTPUT_PATH
To evaluate your Model on a specific language, modify the --tasks
parameter:
--tasks milu_English
Replace English
with the available language (e.g., Odia, Hindi, etc.).
- Make sure to use
--apply_chat_template
for Instruction-fine-tuned models, to format the prompt correctly. - vLLM generally works better with Llama models, while Gemma models work better with HuggingFace.
- If vLLM encounters out-of-memory errors, try reducing
max_gpu_utilization
else switch to HuggingFace. - For HuggingFace, use
--batch_size=auto:<n_batch_resize_tries>
to re-select the batch size multiple times. - When using vLLM, pass generation kwargs in the
--gen_kwargs
flag. For HuggingFace, include them inmodel_args
.
- 11 Indian Languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, and English
- Domains: 8 diverse domains including Arts & Humanities, Social Sciences, STEM, and more
- Subjects: 42 subjects covering a wide range of topics
- Questions: ~85,000 multiple-choice questions
- Cultural Relevance: Incorporates India-specific knowledge from regional and state-level examinations
Language | Total Questions | Translated Questions | Avg Words Per Question |
---|---|---|---|
Bengali | 7138 | 1601 | 15.72 |
Gujarati | 5327 | 2755 | 16.69 |
Hindi | 15450 | 115 | 20.63 |
Kannada | 6734 | 1522 | 12.83 |
Malayalam | 4670 | 1534 | 12.82 |
Marathi | 7424 | 1235 | 18.8 |
Odia | 5025 | 1452 | 15.63 |
Punjabi | 4363 | 2341 | 19.9 |
Tamil | 7059 | 1524 | 13.32 |
Telugu | 7847 | 1298 | 16.13 |
English | 14036 | - | 22.01 |
Total | 85073 | 15377 | 16.77 (avg) |
The test set consists of the MILU (Multi-task Indic Language Understanding) benchmark, which contains approximately 85,000 multiple-choice questions across 11 Indic languages.
The dataset includes a separate validation set of 9,157 samples that can be used for few-shot examples during evaluation. This validation set was created by sampling questions from each of the 42 subjects.
Domain | Subjects |
---|---|
Arts & Humanities | Architecture and Design, Arts and Culture, Education, History, Language Studies, Literature and Linguistics, Media and Communication, Music and Performing Arts, Religion and Spirituality |
Business Studies | Business and Management, Economics, Finance and Investment |
Engineering & Tech | Energy and Power, Engineering, Information Technology, Materials Science, Technology and Innovation, Transportation and Logistics |
Environmental Sciences | Agriculture, Earth Sciences, Environmental Science, Geography |
Health & Medicine | Food Science, Health and Medicine |
Law & Governance | Defense and Security, Ethics and Human Rights, Law and Ethics, Politics and Governance |
Math and Sciences | Astronomy and Astrophysics, Biology, Chemistry, Computer Science, Logical Reasoning, Mathematics, Physics |
Social Sciences | Anthropology, International Relations, Psychology, Public Administration, Social Welfare and Development, Sociology, Sports and Recreation |
If you use MILU in your work, please cite us:
@article{verma2024milu,
title = {MILU: A Multi-task Indic Language Understanding Benchmark},
author = {Sshubam Verma and Mohammed Safi Ur Rahman Khan and Vishwajeet Kumar and Rudra Murthy and Jaydeep Sen},
year = {2024},
journal = {arXiv preprint arXiv: 2411.02538}
}
This dataset is released under the CC BY 4.0.
For any questions or feedback, please contact:
- Sshubam Verma ([email protected])
- Mohammed Safi Ur Rahman Khan ([email protected])
- Rudra Murthy ([email protected])
- Vishwajeet Kumar ([email protected])