Releases · EleutherAI/lm-evaluation-harness

31 Jan 15:29

v0.4.1

a0a2fec

v0.4.1

Release Notes

This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .

At a high level, some of the changes include:

Data-parallel inference using vLLM (contributed by @baberabb )
A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
Miscellaneous documentation updates
A number of new tasks, and bugfixes to old tasks!
The support for OpenAI-like API models using local-completions or local-chat-completions ( Thanks to @veekaybee @mgoin @anjor and others on this)!
Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!

More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!

We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.

In the next version release, we hope to include

Chat Templating + System Prompt support, for locally-run models
Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
A new TaskManager object and the deprecation of lm_eval.tasks.initialize_tasks(), for achieving the easier registration of many tasks and configuration of new groups of tasks

What's Changed

Announce v0.4.0 in README by @haileyschoelkopf in #1061
remove commented planned samplers in lm_eval/api/samplers.py by @haileyschoelkopf in #1062
Confirming links in docs work (WIP) by @haileyschoelkopf in #1065
Set actual version to v0.4.0 by @haileyschoelkopf in #1064
Updating docs hyperlinks by @haileyschoelkopf in #1066
Fiddling with READMEs, Reenable CI tests on main by @haileyschoelkopf in #1063
Update _cot_fewshot_template_yaml by @lintangsutawika in #1074
Patch scrolls by @lintangsutawika in #1077
Update template of qqp dataset by @shiweijiezero in #1097
Change the sub-task name from sst to sst2 in glue by @shiweijiezero in #1099
Add kmmlu evaluation to tasks by @h-albert-lee in #1089
Fix stderr by @lintangsutawika in #1106
Simplified evaluator.py by @lintangsutawika in #1104
[Refactor] vllm data parallel by @baberabb in #1035
Unpack group in write_out by @baberabb in #1113
Revert "Simplified evaluator.py" by @lintangsutawika in #1116
qqp, mnli_mismatch: remove unlabled test sets by @baberabb in #1114
fix: bug of BBH_cot_fewshot by @Momo-Tori in #1118
Bump BBH version by @haileyschoelkopf in #1120
Refactor hf modeling code by @haileyschoelkopf in #1096
Additional process for doc_to_choice by @lintangsutawika in #1093
doc_to_decontamination_query can use function by @lintangsutawika in #1082
Fix vllm batch_size type by @xTayEx in #1128
fix: passing max_length to vllm engine args by @NanoCode012 in #1124
Fix Loading Local Dataset by @lintangsutawika in #1127
place model onto mps by @baberabb in #1133
Add benchmark FLD by @MorishT in #1122
fix typo in README.md by @lennijusten in #1136
add correct openai api key to README.md by @lennijusten in #1138
Update Linter CI Job by @haileyschoelkopf in #1130
add utils.clear_torch_cache() to model_comparator by @baberabb in #1142
Enabling OpenAI completions via gooseai by @veekaybee in #1141
vllm clean up tqdm by @baberabb in #1144
openai nits by @baberabb in #1139
Add IFEval / Instruction-Following Eval by @wiskojo in #1087
set --gen_kwargs arg to None by @baberabb in #1145
Add shorthand flags by @baberabb in #1149
fld bugfix by @baberabb in #1150
Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in #1154
Add docs on adding a multiple choice metric by @polm-stability in #1147
Simplify evaluator by @lintangsutawika in #1126
Generalize Qwen tokenizer fix by @haileyschoelkopf in #1146
self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in #1172
Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in #1171
feat: add option to upload results to Zeno by @Sparkier in #990
Switch Linting to ruff by @baberabb in #1166
Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in #1178
Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in #1174
Update README.md by @anjor in #1184
Update README.md by @anjor in #1183
Add tokenizer backend by @anjor in #1186
Correctly Print Task Versioning by @haileyschoelkopf in #1173
update Zeno example and reference in README by @Sparkier in #1190
Remove tokenizer for openai chat completions by @anjor in #1191
Update README.md by @anjor in #1181
disable mypy by @baberabb in #1193
Generic decorator for handling rate limit errors by @zachschillaci27 in #1109
Refer in README to main branch by @BramVanroy in #1200
Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in #1189
Upstream Mamba Support (mamba_ssm) by @haileyschoelkopf in #1110
Update cuda handling by @anjor in #1180
Fix documentation in API table by @haileyschoelkopf in #1203
Consolidate batching by @baberabb in #1197
Add remove_whitespace to FLD benchmark by @MorishT in #1206
Fix the argument order in utils.divide doc by @xTayEx in #1208
[Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in #1212
fix unbounded local variable by @onnoo in #1218
nits + fix siqa by @baberabb in #1216
add length of strings and answer options to Zeno met...

Contributors

pminervini, nairbv, and 35 other contributors

Assets 2

04 Dec 15:08

StellaAthena

v0.4.0

c9bbec6

v0.4.0

What's Changed

Replace stale triviaqa dataset link by @jon-tow in #364
Update actions/setup-pythonin CI workflows by @jon-tow in #365
Bump triviaqa version by @jon-tow in #366
Update lambada_openai multilingual data source by @jon-tow in #370
Update Pile Test/Val Download URLs by @fattorib in #373
Added ToxiGen task by @Thartvigsen in #377
Added CrowSPairs by @aflah02 in #379
Add accuracy metric to crows-pairs by @haileyschoelkopf in #380
hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in #384
Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by @sxjscience in #390
Upstream hf-causal and hf-seq2seq model implementations by @haileyschoelkopf in #381
Hosting arithmetic dataset on HuggingFace by @fattorib in #391
Hosting wikitext on HuggingFace by @fattorib in #396
Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in #403
Update README installation instructions by @haileyschoelkopf in #407
feat: evaluation using peft models with CLM by @zanussbaum in #414
Update setup.py dependencies by @ret2libc in #416
fix: add seq2seq peft by @zanussbaum in #418
Add support for load_in_8bit and trust_remote_code model params by @philwee in #422
Hotfix: patch issues with the huggingface.py model classes by @haileyschoelkopf in #427
Continuing work on refactor [WIP] by @haileyschoelkopf in #425
Document task name wildcard support in README by @haileyschoelkopf in #435
Add non-programmatic BIG-bench-hard tasks by @yurodiviy in #406
Updated handling for device in lm_eval/models/gpt2.py by @nikhilpinnaparaju in #447
[WIP, Refactor] Staging more changes by @haileyschoelkopf in #465
[Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in #467
Configurable-Tasks by @lintangsutawika in #438
single GPU automatic batching logic by @fattorib in #394
Fix bugs introduced in #394 #406 and max length bug by @juletx in #472
Sort task names to keep the same order always by @juletx in #474
Set PAD token to EOS token by @nikhilpinnaparaju in #448
[Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in #486
fix adaptive batch crash when there are no new requests by @jquesnelle in #490
Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in #426
Create output path directory if necessary by @janEbert in #483
Add results of various models in json and md format by @juletx in #477
Update config by @lintangsutawika in #501
P3 prompt task by @lintangsutawika in #493
Evaluation Against Portion of Benchmark Data by @kenhktsui in #480
Add option to dump prompts and completions to a JSON file by @juletx in #492
Add perplexity task on arbitrary JSON data by @janEbert in #481
Update config by @lintangsutawika in #520
Data Parallelism by @fattorib in #488
Fix mgpt fewshot by @lintangsutawika in #522
Extend dtype command line flag to HFLM by @haileyschoelkopf in #523
Add support for loading GPTQ models via AutoGPTQ by @gakada in #519
Change type signature of quantized and its default value for python < 3.11 compatibility by @passaglia in #532
Fix LLaMA tokenization issue by @gakada in #531
[Refactor] Make promptsource an extra / not required for installation by @haileyschoelkopf in #542
Move spaces from context to continuation by @gakada in #546
Use max_length in AutoSeq2SeqLM by @gakada in #551
Fix typo by @kwikiel in #557
Add load_in_4bit and fix peft loading by @gakada in #556
Update task_guide.md by @haileyschoelkopf in #564
[Refactor] Non-greedy generation ; WIP GSM8k yaml by @haileyschoelkopf in #559
Dataset metric log [WIP] by @lintangsutawika in #560
Add Anthropic support by @zphang in #562
Add MultipleChoiceExactTask by @gakada in #537
Revert "Add MultipleChoiceExactTask" by @StellaAthena in #568
[Refactor] [WIP] New YAML advanced docs by @haileyschoelkopf in #567
Remove the registration of "GPT2" as a model type by @StellaAthena in #574
[Refactor] Docs update by @haileyschoelkopf in #577
Better docs by @lintangsutawika in #576
Update evaluator.py cache_db argument str if model is not str by @poedator in #575
Add --max_batch_size and --batch_size auto:N by @gakada in #572
[Refactor] ALL_TASKS now maintained (not static) by @haileyschoelkopf in #581
Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by @haileyschoelkopf in #582
Fix non-callable attributes in CachingLM by @gakada in #584
Add error handling for calling .to(device) by @haileyschoelkopf in #585
fixes some minor issues on tasks. by @lintangsutawika in #580
Add - 4bit-related args by @SONG-WONHO in #579
Fix triviaqa task by @seopbo in #525
[Refactor] Addressing Feedback on new docs pages by @haileyschoelkopf in #578
Logging Samples by @farzanehnakhaee70 in #563
Merge master into big-refactor by @gakada in #590
[Refactor] Package YAMLs alongside pip installations of lm-eval by @haileyschoelkopf in #596
fixes for multiple_choice by @lintangsutawika in #598
add openbookqa config by @farzanehnakhaee70 in #600
[Refactor] Model guide docs by @haileyschoelkopf in #606
[Refactor] More MCQA fixes by @haileyschoelkopf in #599
[Refactor] Hellaswag by @nopperl in #608
[Refactor] Seq2Seq Models with Multi-Device Support ...

Contributors

ret2libc, jquesnelle, and 40 other contributors

Assets 2

08 Dec 08:34

jon-tow

v0.3.0

62ca184

v0.3.0

HuggingFace Datasets Integration

This release integrates HuggingFace datasets as the core dataset management interface, removing previous custom downloaders.

What's Changed

Refactor Task downloading to use HuggingFace.datasets by @jon-tow in #300
Add templates and update docs by @jon-tow in #308
Add dataset features to TriviaQA by @jon-tow in #305
Add SWAG by @jon-tow in #306
Fixes for using lm_eval as a library by @dirkgr in #309
Researcher2 by @researcher2 in #261
Suggested updates for the task guide by @StephenHogg in #301
Add pre-commit by @Mistobaan in #317
Decontam import fix by @jon-tow in #321
Add bootstrap_iters kwarg by @Muennighoff in #322
Update decontamination.md by @researcher2 in #331
Fix key access in squad evaluation metrics by @konstantinschulz in #333
Fix make_disjoint_window for tail case by @richhankins in #336
Manually concat tokenizer revision with subfolder by @jon-tow in #343
[deps] Use minimum versioning for numexpr by @jon-tow in #352
Remove custom datasets that are in HF by @jon-tow in #330
Add TextSynth API by @jon-tow in #299
Add the original LAMBADA dataset by @jon-tow in #357

New Contributors

@dirkgr made their first contribution in #309
@Mistobaan made their first contribution in #317
@konstantinschulz made their first contribution in #333
@richhankins made their first contribution in #336

Full Changelog: v0.2.0...v0.3.0

Contributors

Mistobaan, richhankins, and 6 other contributors

Assets 2

07 Mar 02:12

leogao2

v0.2.0

7064d6b

v0.2.0

Major changes since 0.1.0:

added blimp (#237)
added qasper (#264)
added asdiv (#244)
added truthfulqa (#219)
added gsm (#260)
implemented description dict and deprecated provide_description (#226)
new --check_integrity flag to run integrity unit tests at eval time (#290)
positional arguments to evaluate and simple_evaluate are now deprecated
_CITATION attribute on task modules (#292)
lots of bug fixes and task fixes (always remember to report task versions for comparability!)

Assets 2

02 Sep 02:28

leogao2

v0.0.1

72d39b7

v0.0.1

Rename package

Assets 2

Releases: EleutherAI/lm-evaluation-harness

v0.4.1

Release Notes

What's Changed

Contributors

Uh oh!

v0.4.0

What's Changed

Contributors

Uh oh!

v0.3.0

HuggingFace Datasets Integration

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.0

Uh oh!

v0.0.1

Uh oh!