Release lm-eval v0.4.9.2 Release Notes · EleutherAI/lm-evaluation-harness

This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version.

New Benchmarks & Tasks

A big wave of new evaluation tasks this release:

AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311
BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338
GraphWalks by @jannalulu in #3377
ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265
Icelandic WinoGrande by @jmichaelov in #3277
CLIcK Korean benchmark by @shing100 in #3173
MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705
EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167
EQBench in Spanish and Catalan by @priverabsc in #3168
Anthropic discrim-eval by @Helw150 in #3091
XNLI-VA by @FranValero97 in #3194
Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317
HumanEval infilling by @its-alpesh in #3299
CNN-DailyMail 3.0.0 by @preordinary in #3426
Global PIQA and new acc_norm_bytes metric by @baberabb in #3368

Fixes & Improvements

Core Changes:

Python 3.10 minimum by @jannalulu in #3337
Unpinned datasets library by @baberabb in #3316
BOS token handling: Delegate to tokenizer; add_bos_token now defaults to None by @baberabb in #3347
Renamed LOGLEVEL env var to LMEVAL_LOG_LEVEL to avoid conflicts by @fxmarty-amd in #3418
Resolve duplicate task names with safeguards by @giuliolovisotto in #3394

Task Fixes:

Fixed MMLU-Redux to exclude samples without error_type="ok" and display summary table by @fxmarty-amd in #3410, #3406
Fixed AIME answer extraction by @jannalulu in #3353
Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361
Fixed crows_pairs dataset by @jannalulu in #3378
Fixed Gemma tokenizer add_bos_token not updating by @DarkLight1337 in #3206
Fixed lambada_multilingual_stablelm by @jmichaelov, @HallerPatrick in #3294, #3222
Fixed CodeXGLUE by @gsaltintas in #3238
Pinned correct MMLUSR version by @christinaexyou in #3350
Updated minerva_math by @baberabb in #3259

Backend Fixes:

Fixed vLLM import errors when not installed by @fxmarty-amd in #3292
Fixed vLLM data_parallel_size>1 issue by @Dornavineeth in #3303
Resolved deprecated vllm.utils.get_open_port by @DarkLight1337 in #3398
Fixed GPT series model bugs by @zinccat in #3348
Fixed PIL image hashing to use actual bytes by @tboerstad in #3331
Fixed additional_config parsing by @brian-dellabetta in #3393
Fixed batch chunking seed handling with groupby by @slimfrkha in #3047
Fixed no-output error handling by @Oseltamivir in #3395
Replaced deprecated torch_dtype with dtype by @AbdulmalikDS in #3415
Fixed custom task config reading by @SkyR0ver in #3425

Model & Backend Support

OpenAI GPT-5 support by @babyplutokurt in #3247
Azure OpenAI support by @zinccat in #3349
Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234
OpenVINO text2text models by @nikita-savelyevv in #3101
Intel XPU support for HFLM by @kaixuanliu in #3211
Attention head steering support by @luciaquirke in #3279
Leverage vLLM's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185

What's Changed

Remove trust_remote_code: True from updated datasets by @Avelina9X in #3213
Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in #3234
Fix add_bos_token not updated for Gemma tokenizer by @DarkLight1337 in #3206
remove incomplete compilation instructions, solves #3233 by @ceferisbarov in #3242
Update utils.py by @Anri-Lombard in #3246
Adding support for OpenAI GPT-5 model by @babyplutokurt in #3247
Add xnli_va dataset by @FranValero97 in #3194
Add ZhoBLiMP benchmark by @jmichaelov in #3218
Add BLiMP-NL by @jmichaelov in #3221
Add TurBLiMP by @jmichaelov in #3219
Add LM-SynEval Benchmark by @jmichaelov in #3184
Fix unknown group key to tag in yaml config for lambada_multilingual_stablelm by @HallerPatrick in #3222
update minerva_math by @baberabb in #3259
feat: Add CLIcK task by @shing100 in #3173
Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in #3091
Add support for OpenVINO text2text generation models by @nikita-savelyevv in #3101
Update MMLU-ProX task by @weihao1115 in #3174
Support for AIME dataset by @jannalulu in #3248
feat(scrolls): delete chat_template from kwargs by @slimfrkha in #3267
pacify pre-commit by @baberabb in #3268
Fix codexglue by @gsaltintas in #3238
Add BHS benchmark by @jmichaelov in #3265
Add acc_norm metric to BLiMP-NL by @jmichaelov in #3272
Add acc_norm metric to ZhoBLiMP by @jmichaelov in #3271
Add EsBBQ and CaBBQ tasks by @valleruizf in #3167
Add support for steering individual attention heads by @luciaquirke in #3279
Add the Icelandic WinoGrande benchmark by @jmichaelov in #3277
Ignore seed when splitting batch in chunks with groupby by @slimfrkha in #3047
[fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in #3292
Fix LongBench Evaluation by @TimurAysin in #3273
add intel xpu support for HFLM by @kaixuanliu in #3211
feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in #2705
Add BabiLong by @jannalulu in #3287
Add AIME to task description by @jannalulu in #3296
Add humaneval_infilling task by @its-alpesh in #3299
Add eqbench tasks in Spanish and Catalan by @priverabsc in #3168
[fix] add math and longbench to test dependencies by @jannalulu in #3321
Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in #3303
unpin datasets; update pre-commit by @baberabb in #3316
bump to python 3.10 by @jannalulu in #3337
Longbench v2 by @jannalulu in #3338
Leverage vllm's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185
Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in #3317
remove duplicate tags/groups by @baberabb in #3343
Align humaneval_64_instruct task label in README to name in yaml file by @jmichaelov in #3344
Fixes bugs when using gpt series model by @zinccat in #3348
[fix] aime doesn't extract answers by @jannalulu in #3353
add global_piqa; add acc_norm_bytes metric by @baberabb in #3368
[fix] crows_pairs dataset by @jannalulu in #3378
Fix issue 3355 assertion error by @marksverdhei in #3356
fix(gsm8k): align README to yaml file by @neoheartbeats in #3388
added azure openai support by @zinccat in #3349
Delegate BOS to the tokenizer; add_bos_token defaults to None by @baberabb in #3347
fix trust_remote_code=True for longbench by @jannalulu in #3361
[feat] add graphwalks by @jannalulu in #3377
Longbench group fix by @jannalulu in #3359
Resolve deprecation of vllm.utils.get_open_port by @DarkLight1337 in #3398
Trim whitespace in remove_whitespace filter by @ziqing-huang in #3408
Fixes #3391 avoid error on no-output by @Oseltamivir in #3395
Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in #3331
[MMLU redux] Do not use samples which do not have error_type="ok" by @fxmarty-amd in #3410
fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in #3394
Add MATH500 by @jannalulu in #3311
[bugfix] additional_config parsing by @brian-dellabetta in #3393
fix(tasks):pin correct MMLUSR version by @christinaexyou in #3350
Fix lambada_multilingual_stablelm by @jmichaelov in #3294
Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in #3374
Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in #3415
[fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in #3406
Rename the conflicting environment variable LOGLEVEL to LMEVAL_LOG_LEVEL by @fxmarty-amd in #3418
Update SGLang installation and documentation links by @Bobchenyx in #3422
Fix reading custom task configs by @SkyR0ver in #3425
New Task: Add CNN-DailyMail (3.0.0) by @preordinary in #3426

New Contributors

@LearnerSXH made their first contribution in #3234
@ceferisbarov made their first contribution in #3242
@Anri-Lombard made their first contribution in #3246
@babyplutokurt made their first contribution in #3247
@FranValero97 made their first contribution in #3194
@HallerPatrick made their first contribution in #3222
@Helw150 made their first contribution in #3091
@nikita-savelyevv made their first contribution in #3101
@weihao1115 made their first contribution in #3174
@jannalulu made their first contribution in #3248
@slimfrkha made their first contribution in #3267
@gsaltintas made their first contribution in #3238
@valleruizf made their first contribution in #3167
@TimurAysin made their first contribution in #3273
@kaixuanliu made their first contribution in #3211
@its-alpesh made their first contribution in #3299
@priverabsc made their first contribution in #3168
@Dornavineeth made their first contribution in #3303
@m-misiura made their first contribution in #3185
@Ismail-Hossain-1 made their first contribution in #3317
@zinccat made their first contribution in #3348
@marksverdhei made their first contribution in #3356
@neoheartbeats made their first contribution in #3388
@ziqing-huang made their first contribution in #3408
@Oseltamivir made their first contribution in #3395
@tboerstad made their first contribution in #3331
@brian-dellabetta made their first contribution in #3393
@christinaexyou made their first contribution in #3350
@AbdulmalikDS made their first contribution in #3415
@Bobchenyx made their first contribution in #3422
@SkyR0ver made their first contribution in #3425
@preordinary made their first contribution in #3426

Full Changelog: v0.4.9.1...v0.4.9.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lm-eval v0.4.9.2 Release Notes

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

New Benchmarks & Tasks

Fixes & Improvements

Model & Backend Support

What's Changed

New Contributors

Contributors

Uh oh!