Releases: EleutherAI/lm-evaluation-harness
v0.4.1
Release Notes
This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .
At a high level, some of the changes include:
- Data-parallel inference using vLLM (contributed by @baberabb )
- A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
- Miscellaneous documentation updates
- A number of new tasks, and bugfixes to old tasks!
- The support for OpenAI-like API models using
local-completionsorlocal-chat-completions( Thanks to @veekaybee @mgoin @anjor and others on this)! - Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!
More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!
We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.
In the next version release, we hope to include
- Chat Templating + System Prompt support, for locally-run models
- Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
- General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
- A new
TaskManagerobject and the deprecation oflm_eval.tasks.initialize_tasks(), for achieving the easier registration of many tasks and configuration of new groups of tasks
What's Changed
- Announce v0.4.0 in README by @haileyschoelkopf in #1061
- remove commented planned samplers in
lm_eval/api/samplers.pyby @haileyschoelkopf in #1062 - Confirming links in docs work (WIP) by @haileyschoelkopf in #1065
- Set actual version to v0.4.0 by @haileyschoelkopf in #1064
- Updating docs hyperlinks by @haileyschoelkopf in #1066
- Fiddling with READMEs, Reenable CI tests on
mainby @haileyschoelkopf in #1063 - Update _cot_fewshot_template_yaml by @lintangsutawika in #1074
- Patch scrolls by @lintangsutawika in #1077
- Update template of qqp dataset by @shiweijiezero in #1097
- Change the sub-task name from sst to sst2 in glue by @shiweijiezero in #1099
- Add kmmlu evaluation to tasks by @h-albert-lee in #1089
- Fix stderr by @lintangsutawika in #1106
- Simplified
evaluator.pyby @lintangsutawika in #1104 - [Refactor] vllm data parallel by @baberabb in #1035
- Unpack group in
write_outby @baberabb in #1113 - Revert "Simplified
evaluator.py" by @lintangsutawika in #1116 qqp,mnli_mismatch: remove unlabled test sets by @baberabb in #1114- fix: bug of BBH_cot_fewshot by @Momo-Tori in #1118
- Bump BBH version by @haileyschoelkopf in #1120
- Refactor
hfmodeling code by @haileyschoelkopf in #1096 - Additional process for doc_to_choice by @lintangsutawika in #1093
- doc_to_decontamination_query can use function by @lintangsutawika in #1082
- Fix vllm
batch_sizetype by @xTayEx in #1128 - fix: passing max_length to vllm engine args by @NanoCode012 in #1124
- Fix Loading Local Dataset by @lintangsutawika in #1127
- place model onto
mpsby @baberabb in #1133 - Add benchmark FLD by @MorishT in #1122
- fix typo in README.md by @lennijusten in #1136
- add correct openai api key to README.md by @lennijusten in #1138
- Update Linter CI Job by @haileyschoelkopf in #1130
- add utils.clear_torch_cache() to model_comparator by @baberabb in #1142
- Enabling OpenAI completions via gooseai by @veekaybee in #1141
- vllm clean up tqdm by @baberabb in #1144
- openai nits by @baberabb in #1139
- Add IFEval / Instruction-Following Eval by @wiskojo in #1087
- set
--gen_kwargsarg to None by @baberabb in #1145 - Add shorthand flags by @baberabb in #1149
- fld bugfix by @baberabb in #1150
- Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in #1154
- Add docs on adding a multiple choice metric by @polm-stability in #1147
- Simplify evaluator by @lintangsutawika in #1126
- Generalize Qwen tokenizer fix by @haileyschoelkopf in #1146
- self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in #1172
- Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in #1171
- feat: add option to upload results to Zeno by @Sparkier in #990
- Switch Linting to
ruffby @baberabb in #1166 - Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in #1178
- Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in #1174
- Update README.md by @anjor in #1184
- Update README.md by @anjor in #1183
- Add tokenizer backend by @anjor in #1186
- Correctly Print Task Versioning by @haileyschoelkopf in #1173
- update Zeno example and reference in README by @Sparkier in #1190
- Remove tokenizer for openai chat completions by @anjor in #1191
- Update README.md by @anjor in #1181
- disable
mypyby @baberabb in #1193 - Generic decorator for handling rate limit errors by @zachschillaci27 in #1109
- Refer in README to main branch by @BramVanroy in #1200
- Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in #1189
- Upstream Mamba Support (
mamba_ssm) by @haileyschoelkopf in #1110 - Update cuda handling by @anjor in #1180
- Fix documentation in API table by @haileyschoelkopf in #1203
- Consolidate batching by @baberabb in #1197
- Add remove_whitespace to FLD benchmark by @MorishT in #1206
- Fix the argument order in
utils.dividedoc by @xTayEx in #1208 - [Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in #1212
- fix unbounded local variable by @onnoo in #1218
- nits + fix siqa by @baberabb in #1216
- add length of strings and answer options to Zeno met...
v0.4.0
What's Changed
- Replace stale
triviaqadataset link by @jon-tow in #364 - Update
actions/setup-pythonin CI workflows by @jon-tow in #365 - Bump
triviaqaversion by @jon-tow in #366 - Update
lambada_openaimultilingual data source by @jon-tow in #370 - Update Pile Test/Val Download URLs by @fattorib in #373
- Added ToxiGen task by @Thartvigsen in #377
- Added CrowSPairs by @aflah02 in #379
- Add accuracy metric to crows-pairs by @haileyschoelkopf in #380
- hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in #384
- Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by @sxjscience in #390
- Upstream
hf-causalandhf-seq2seqmodel implementations by @haileyschoelkopf in #381 - Hosting arithmetic dataset on HuggingFace by @fattorib in #391
- Hosting wikitext on HuggingFace by @fattorib in #396
- Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in #403
- Update README installation instructions by @haileyschoelkopf in #407
- feat: evaluation using peft models with CLM by @zanussbaum in #414
- Update setup.py dependencies by @ret2libc in #416
- fix: add seq2seq peft by @zanussbaum in #418
- Add support for load_in_8bit and trust_remote_code model params by @philwee in #422
- Hotfix: patch issues with the
huggingface.pymodel classes by @haileyschoelkopf in #427 - Continuing work on refactor [WIP] by @haileyschoelkopf in #425
- Document task name wildcard support in README by @haileyschoelkopf in #435
- Add non-programmatic BIG-bench-hard tasks by @yurodiviy in #406
- Updated handling for device in lm_eval/models/gpt2.py by @nikhilpinnaparaju in #447
- [WIP, Refactor] Staging more changes by @haileyschoelkopf in #465
- [Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in #467
- Configurable-Tasks by @lintangsutawika in #438
- single GPU automatic batching logic by @fattorib in #394
- Fix bugs introduced in #394 #406 and max length bug by @juletx in #472
- Sort task names to keep the same order always by @juletx in #474
- Set PAD token to EOS token by @nikhilpinnaparaju in #448
- [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in #486
- fix adaptive batch crash when there are no new requests by @jquesnelle in #490
- Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in #426
- Create output path directory if necessary by @janEbert in #483
- Add results of various models in json and md format by @juletx in #477
- Update config by @lintangsutawika in #501
- P3 prompt task by @lintangsutawika in #493
- Evaluation Against Portion of Benchmark Data by @kenhktsui in #480
- Add option to dump prompts and completions to a JSON file by @juletx in #492
- Add perplexity task on arbitrary JSON data by @janEbert in #481
- Update config by @lintangsutawika in #520
- Data Parallelism by @fattorib in #488
- Fix mgpt fewshot by @lintangsutawika in #522
- Extend
dtypecommand line flag toHFLMby @haileyschoelkopf in #523 - Add support for loading GPTQ models via AutoGPTQ by @gakada in #519
- Change type signature of
quantizedand its default value for python < 3.11 compatibility by @passaglia in #532 - Fix LLaMA tokenization issue by @gakada in #531
- [Refactor] Make promptsource an extra / not required for installation by @haileyschoelkopf in #542
- Move spaces from context to continuation by @gakada in #546
- Use max_length in AutoSeq2SeqLM by @gakada in #551
- Fix typo by @kwikiel in #557
- Add load_in_4bit and fix peft loading by @gakada in #556
- Update task_guide.md by @haileyschoelkopf in #564
- [Refactor] Non-greedy generation ; WIP GSM8k yaml by @haileyschoelkopf in #559
- Dataset metric log [WIP] by @lintangsutawika in #560
- Add Anthropic support by @zphang in #562
- Add MultipleChoiceExactTask by @gakada in #537
- Revert "Add MultipleChoiceExactTask" by @StellaAthena in #568
- [Refactor] [WIP] New YAML advanced docs by @haileyschoelkopf in #567
- Remove the registration of "GPT2" as a model type by @StellaAthena in #574
- [Refactor] Docs update by @haileyschoelkopf in #577
- Better docs by @lintangsutawika in #576
- Update evaluator.py cache_db argument str if model is not str by @poedator in #575
- Add --max_batch_size and --batch_size auto:N by @gakada in #572
- [Refactor] ALL_TASKS now maintained (not static) by @haileyschoelkopf in #581
- Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by @haileyschoelkopf in #582
- Fix non-callable attributes in CachingLM by @gakada in #584
- Add error handling for calling
.to(device)by @haileyschoelkopf in #585 - fixes some minor issues on tasks. by @lintangsutawika in #580
- Add - 4bit-related args by @SONG-WONHO in #579
- Fix triviaqa task by @seopbo in #525
- [Refactor] Addressing Feedback on new docs pages by @haileyschoelkopf in #578
- Logging Samples by @farzanehnakhaee70 in #563
- Merge master into big-refactor by @gakada in #590
- [Refactor] Package YAMLs alongside pip installations of lm-eval by @haileyschoelkopf in #596
- fixes for multiple_choice by @lintangsutawika in #598
- add openbookqa config by @farzanehnakhaee70 in #600
- [Refactor] Model guide docs by @haileyschoelkopf in #606
- [Refactor] More MCQA fixes by @haileyschoelkopf in #599
- [Refactor] Hellaswag by @nopperl in #608
- [Refactor] Seq2Seq Models with Multi-Device Support ...
v0.3.0
HuggingFace Datasets Integration
This release integrates HuggingFace datasets as the core dataset management interface, removing previous custom downloaders.
What's Changed
- Refactor
Taskdownloading to useHuggingFace.datasetsby @jon-tow in #300 - Add templates and update docs by @jon-tow in #308
- Add dataset features to
TriviaQAby @jon-tow in #305 - Add
SWAGby @jon-tow in #306 - Fixes for using lm_eval as a library by @dirkgr in #309
- Researcher2 by @researcher2 in #261
- Suggested updates for the task guide by @StephenHogg in #301
- Add pre-commit by @Mistobaan in #317
- Decontam import fix by @jon-tow in #321
- Add bootstrap_iters kwarg by @Muennighoff in #322
- Update decontamination.md by @researcher2 in #331
- Fix key access in squad evaluation metrics by @konstantinschulz in #333
- Fix make_disjoint_window for tail case by @richhankins in #336
- Manually concat tokenizer revision with subfolder by @jon-tow in #343
- [deps] Use minimum versioning for
numexprby @jon-tow in #352 - Remove custom datasets that are in HF by @jon-tow in #330
- Add
TextSynthAPI by @jon-tow in #299 - Add the original
LAMBADAdataset by @jon-tow in #357
New Contributors
- @dirkgr made their first contribution in #309
- @Mistobaan made their first contribution in #317
- @konstantinschulz made their first contribution in #333
- @richhankins made their first contribution in #336
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Major changes since 0.1.0:
- added blimp (#237)
- added qasper (#264)
- added asdiv (#244)
- added truthfulqa (#219)
- added gsm (#260)
- implemented description dict and deprecated provide_description (#226)
- new
--check_integrityflag to run integrity unit tests at eval time (#290) - positional arguments to
evaluateandsimple_evaluateare now deprecated _CITATIONattribute on task modules (#292)- lots of bug fixes and task fixes (always remember to report task versions for comparability!)