This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version.
New Benchmarks & Tasks
A big wave of new evaluation tasks this release:
- AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311
- BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338
- GraphWalks by @jannalulu in #3377
- ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265
- Icelandic WinoGrande by @jmichaelov in #3277
- CLIcK Korean benchmark by @shing100 in #3173
- MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705
- EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167
- EQBench in Spanish and Catalan by @priverabsc in #3168
- Anthropic discrim-eval by @Helw150 in #3091
- XNLI-VA by @FranValero97 in #3194
- Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317
- HumanEval infilling by @its-alpesh in #3299
- CNN-DailyMail 3.0.0 by @preordinary in #3426
- Global PIQA and new
acc_norm_bytesmetric by @baberabb in #3368
Fixes & Improvements
Core Changes:
- Python 3.10 minimum by @jannalulu in #3337
- Unpinned
datasetslibrary by @baberabb in #3316 - BOS token handling: Delegate to tokenizer;
add_bos_tokennow defaults toNoneby @baberabb in #3347 - Renamed
LOGLEVELenv var toLMEVAL_LOG_LEVELto avoid conflicts by @fxmarty-amd in #3418 - Resolve duplicate task names with safeguards by @giuliolovisotto in #3394
Task Fixes:
- Fixed MMLU-Redux to exclude samples without
error_type="ok"and display summary table by @fxmarty-amd in #3410, #3406 - Fixed AIME answer extraction by @jannalulu in #3353
- Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361
- Fixed
crows_pairsdataset by @jannalulu in #3378 - Fixed Gemma tokenizer
add_bos_tokennot updating by @DarkLight1337 in #3206 - Fixed
lambada_multilingual_stablelmby @jmichaelov, @HallerPatrick in #3294, #3222 - Fixed CodeXGLUE by @gsaltintas in #3238
- Pinned correct MMLUSR version by @christinaexyou in #3350
- Updated
minerva_mathby @baberabb in #3259
Backend Fixes:
- Fixed vLLM import errors when not installed by @fxmarty-amd in #3292
- Fixed vLLM
data_parallel_size>1issue by @Dornavineeth in #3303 - Resolved deprecated
vllm.utils.get_open_portby @DarkLight1337 in #3398 - Fixed GPT series model bugs by @zinccat in #3348
- Fixed PIL image hashing to use actual bytes by @tboerstad in #3331
- Fixed
additional_configparsing by @brian-dellabetta in #3393 - Fixed batch chunking seed handling with groupby by @slimfrkha in #3047
- Fixed no-output error handling by @Oseltamivir in #3395
- Replaced deprecated
torch_dtypewithdtypeby @AbdulmalikDS in #3415 - Fixed custom task config reading by @SkyR0ver in #3425
Model & Backend Support
- OpenAI GPT-5 support by @babyplutokurt in #3247
- Azure OpenAI support by @zinccat in #3349
- Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234
- OpenVINO text2text models by @nikita-savelyevv in #3101
- Intel XPU support for HFLM by @kaixuanliu in #3211
- Attention head steering support by @luciaquirke in #3279
- Leverage vLLM's
tokenizer_infoendpoint to avoid manual duplication by @m-misiura in #3185
What's Changed
- Remove
trust_remote_code: Truefrom updated datasets by @Avelina9X in #3213 - Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in #3234
- Fix
add_bos_tokennot updated for Gemma tokenizer by @DarkLight1337 in #3206 - remove incomplete compilation instructions, solves #3233 by @ceferisbarov in #3242
- Update utils.py by @Anri-Lombard in #3246
- Adding support for OpenAI GPT-5 model by @babyplutokurt in #3247
- Add xnli_va dataset by @FranValero97 in #3194
- Add ZhoBLiMP benchmark by @jmichaelov in #3218
- Add BLiMP-NL by @jmichaelov in #3221
- Add TurBLiMP by @jmichaelov in #3219
- Add LM-SynEval Benchmark by @jmichaelov in #3184
- Fix unknown group key to tag in yaml config for
lambada_multilingual_stablelmby @HallerPatrick in #3222 - update
minerva_mathby @baberabb in #3259 - feat: Add CLIcK task by @shing100 in #3173
- Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in #3091
- Add support for OpenVINO text2text generation models by @nikita-savelyevv in #3101
- Update MMLU-ProX task by @weihao1115 in #3174
- Support for AIME dataset by @jannalulu in #3248
- feat(scrolls): delete chat_template from kwargs by @slimfrkha in #3267
- pacify pre-commit by @baberabb in #3268
- Fix codexglue by @gsaltintas in #3238
- Add BHS benchmark by @jmichaelov in #3265
- Add
acc_normmetric to BLiMP-NL by @jmichaelov in #3272 - Add
acc_normmetric to ZhoBLiMP by @jmichaelov in #3271 - Add EsBBQ and CaBBQ tasks by @valleruizf in #3167
- Add support for steering individual attention heads by @luciaquirke in #3279
- Add the Icelandic WinoGrande benchmark by @jmichaelov in #3277
- Ignore seed when splitting batch in chunks with groupby by @slimfrkha in #3047
- [fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in #3292
- Fix LongBench Evaluation by @TimurAysin in #3273
- add intel xpu support for HFLM by @kaixuanliu in #3211
- feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in #2705
- Add BabiLong by @jannalulu in #3287
- Add AIME to task description by @jannalulu in #3296
- Add humaneval_infilling task by @its-alpesh in #3299
- Add eqbench tasks in Spanish and Catalan by @priverabsc in #3168
- [fix] add math and longbench to test dependencies by @jannalulu in #3321
- Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in #3303
- unpin datasets; update pre-commit by @baberabb in #3316
- bump to python 3.10 by @jannalulu in #3337
- Longbench v2 by @jannalulu in #3338
- Leverage vllm's
tokenizer_infoendpoint to avoid manual duplication by @m-misiura in #3185 - Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in #3317
- remove duplicate tags/groups by @baberabb in #3343
- Align
humaneval_64_instructtask label in README to name in yaml file by @jmichaelov in #3344 - Fixes bugs when using gpt series model by @zinccat in #3348
- [fix] aime doesn't extract answers by @jannalulu in #3353
- add global_piqa; add acc_norm_bytes metric by @baberabb in #3368
- [fix] crows_pairs dataset by @jannalulu in #3378
- Fix issue 3355 assertion error by @marksverdhei in #3356
- fix(gsm8k): align README to yaml file by @neoheartbeats in #3388
- added azure openai support by @zinccat in #3349
- Delegate BOS to the tokenizer;
add_bos_tokendefaults toNoneby @baberabb in #3347 - fix trust_remote_code=True for longbench by @jannalulu in #3361
- [feat] add graphwalks by @jannalulu in #3377
- Longbench group fix by @jannalulu in #3359
- Resolve deprecation of
vllm.utils.get_open_portby @DarkLight1337 in #3398 - Trim whitespace in remove_whitespace filter by @ziqing-huang in #3408
- Fixes #3391 avoid error on no-output by @Oseltamivir in #3395
- Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in #3331
- [MMLU redux] Do not use samples which do not have
error_type="ok"by @fxmarty-amd in #3410 - fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in #3394
- Add MATH500 by @jannalulu in #3311
- [bugfix] additional_config parsing by @brian-dellabetta in #3393
- fix(tasks):pin correct MMLUSR version by @christinaexyou in #3350
- Fix
lambada_multilingual_stablelmby @jmichaelov in #3294 - Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in #3374
- Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in #3415
- [fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in #3406
- Rename the conflicting environment variable
LOGLEVELtoLMEVAL_LOG_LEVELby @fxmarty-amd in #3418 - Update SGLang installation and documentation links by @Bobchenyx in #3422
- Fix reading custom task configs by @SkyR0ver in #3425
- New Task: Add CNN-DailyMail (3.0.0) by @preordinary in #3426
New Contributors
- @LearnerSXH made their first contribution in #3234
- @ceferisbarov made their first contribution in #3242
- @Anri-Lombard made their first contribution in #3246
- @babyplutokurt made their first contribution in #3247
- @FranValero97 made their first contribution in #3194
- @HallerPatrick made their first contribution in #3222
- @Helw150 made their first contribution in #3091
- @nikita-savelyevv made their first contribution in #3101
- @weihao1115 made their first contribution in #3174
- @jannalulu made their first contribution in #3248
- @slimfrkha made their first contribution in #3267
- @gsaltintas made their first contribution in #3238
- @valleruizf made their first contribution in #3167
- @TimurAysin made their first contribution in #3273
- @kaixuanliu made their first contribution in #3211
- @its-alpesh made their first contribution in #3299
- @priverabsc made their first contribution in #3168
- @Dornavineeth made their first contribution in #3303
- @m-misiura made their first contribution in #3185
- @Ismail-Hossain-1 made their first contribution in #3317
- @zinccat made their first contribution in #3348
- @marksverdhei made their first contribution in #3356
- @neoheartbeats made their first contribution in #3388
- @ziqing-huang made their first contribution in #3408
- @Oseltamivir made their first contribution in #3395
- @tboerstad made their first contribution in #3331
- @brian-dellabetta made their first contribution in #3393
- @christinaexyou made their first contribution in #3350
- @AbdulmalikDS made their first contribution in #3415
- @Bobchenyx made their first contribution in #3422
- @SkyR0ver made their first contribution in #3425
- @preordinary made their first contribution in #3426
Full Changelog: v0.4.9.1...v0.4.9.2