Releases: modelscope/evalscope
v1.4.1
中文版
基准测试数据集
- 命名实体识别: 新增 12 个 NER(命名实体识别)数据集
- 语音识别: 新增 TORGO 数据集,用于构音障碍语音识别评测,支持 SemScore 评估
- 多模态评测: 新增 RefCOCO 基准测试
- 代码评测: 新增 Terminal-bench 终端命令能力评测
功能增强
- 性能测试: 新增 SLA 自动调优功能,优化性能测试体验
- 服务模式: 新增异步服务支持和 Gradio UI 界面
- 数据加载: 优化本地 JSONL 数据集加载功能
问题修复
- 修复 HallusionBench 数据加载问题
- 修复流式响应解析中的 SSE 分块处理问题
English Version
Benchmark Datasets
- Named Entity Recognition: Added 12 NER (Named Entity Recognition) datasets
- Speech Recognition: Added TORGO dataset for dysarthria speech recognition with SemScore evaluation
- Multimodal Evaluation: Added RefCOCO referring expression comprehension benchmark
- Code Evaluation: Added Terminal-bench for terminal command capability assessment
Feature Enhancements
- Performance Testing: Added SLA auto-tuning functionality to optimize performance testing experience
- Service Mode: Added asynchronous service support and Gradio UI interface
- Data Loading: Optimized local JSONL dataset loading functionality
Bug Fixes
- Fixed HallusionBench data loading issues
- Fixed SemScore computation errors
- Fixed eval_config loading related issues
What's Changed
- [Fix] hallusion_bench load data by @Yunnglin in #1092
- [Feature] Add perf SLA auto tune by @Yunnglin in #1095
- [Feature] add service async and gradio ui by @Yunnglin in #1103
- fix(streaming): Robust parsing of SSE chunks with multiple events and \r\n normalization by @amumu96 in #1102
- Add 12 NER Datasets by @penguinwang96825 in #1106
- [Benchmark] Add TORGO Dataset for Dysarthria Speech Recognition with SemScore Evaluation by @penguinwang96825 in #1107
- [Benchmark] Add RefCOCO by @mushenL in #1109
- [Fix] computation error in SemScore by @penguinwang96825 in #1110
- [Feature] Update load local jsonl by @Yunnglin in #1111
- [Fix] eval_config load by @Yunnglin in #1116
- [Benchmark] Add terminal-bench by @Yunnglin in #1114
New Contributors
Full Changelog: v1.4.0...v1.4.1
v1.4.0
中文版
基准测试数据集
- 通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
- 代码评测: 新增 MultiplE、MBPP 等代码能力评测
- 语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试
功能增强
- 性能测试可视化: 新增 ClearML 可视化支持,优化性能测试(perf)监控能力
- 服务API: 新增 service api 功能,提供更灵活的服务调用方式,参考文档
- 懒加载模型: 新增 lazy model 支持,优化模型加载机制
- 重试机制: 新增 retry function,提升评测稳定性
- 沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
- 随机算法优化: 更新性能测试随机算法,提升测试准确性
- UI增强: Dashboard 支持 HTTP params 参数配置
- 进度条优化: 更新 tqdm 进度显示机制
文档优化
- 更新自定义 VQA 相关文档
- 更新参数配置相关文档
- 更新基准测试(benchmarks)文档
- 更新服务(service)相关文档
- 更新 MTEB 相关链接
问题修复
- 修复 --analysis-report、--dataset-dir 等命令行参数问题
- 修复并发为1时的令牌吞吐量计算问题
- 修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
- 修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
- 修复 NLTK 资源检查相关问题
English Version
Benchmark Datasets
- General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
- Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
- Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks
Feature Enhancements
- Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
- Service API: Added service api functionality for more flexible service invocation
- Lazy Model Loading: Added lazy model support to optimize model loading mechanism
- Retry Mechanism: Added retry function to improve evaluation stability
- Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
- Random Algorithm: Updated performance testing random algorithm for improved accuracy
- UI Enhancement: Dashboard now supports HTTP params parameter configuration
- Progress Bar: Updated tqdm progress display mechanism
Documentation
- Updated custom VQA documentation
- Updated parameter configuration documentation
- Updated benchmarks documentation
- Updated service documentation
- Updated MTEB related links
Bug Fixes
- Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
- Fixed token throughput calculation at concurrency 1
- Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
- Fixed SWE-bench image build and MRCR leading newline support
- Fixed NLTK resource checking issues
What's Changed
- [Feature] Add perf ClearML visualization by @Yunnglin in #1032
- [Doc] update custom vqa by @Yunnglin in #1036
- [Benchmark ]Add eq bench by @Yunnglin in #1037
- Feature/zebralogicbench by @nhes in #1035
- [Fix] Update tau2 by @Yunnglin in #1039
- [Feature] Add service api by @Yunnglin in #1042
- fix --analysis-report=true bug by @pumpkin12135 in #1046
- [feature] add lazy model by @Secbone in #1045
- [Doc] update parameter by @Yunnglin in #1048
- [Fix] update default work dir by @Yunnglin in #1049
- [Feature] Update perf random Algorithm by @Yunnglin in #1050
- [Feature] add retry function by @Yunnglin in #1051
- Fix --dataset-dir parameter to work correctly by @gbdjxgp in #1053
- [Fix] chartqa prompt by @Yunnglin in #1054
- [Benchmark] Add fleurs, librispeech by @Yunnglin in #1059
- [Fix] multi-if load by @Yunnglin in #1062
- [Benchmakr] Add MultiplE-mbpp, MBPP by @Yunnglin in #1066
- Update mteb link by @Samoed in #1065
- UI dashboard supports HTTP params parameters by @pumpkin12135 in #1060
- [Feature] update sandbox with pool and multiple-humaneval by @Yunnglin in #1073
- check_nltk_data does not accept a parameter by @Zhaoyi-Yan in #1071
- [Fix] update check nltk resource by @Yunnglin in #1078
- [Fix] omni doc bench load by @Yunnglin in #1079
- [Doc] update benchmarks doc by @Yunnglin in #1081
- [Doc] update service and doc by @Yunnglin in #1085
- fix: Output token throughput and Total token throughput on Concurrency 1 by @cdpath in #1083
- [Fix] SWE build image by @Yunnglin in #1087
- small fixes to mrcr to support leading \n characters by @sophies-cerebras in #1086
- [Feature] Update tqdm process by @Yunnglin in #1089
New Contributors
- @nhes made their first contribution in #1035
- @pumpkin12135 made their first contribution in #1046
- @Secbone made their first contribution in #1045
- @gbdjxgp made their first contribution in #1053
- @Samoed made their first contribution in #1065
- @Zhaoyi-Yan made their first contribution in #1071
- @cdpath made their first contribution in #1083
Full Changelog: v1.3.0...v1.4.0
v1.3.0
中文版
基准测试数据集
- 多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
- 代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
- 通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试
功能增强
- 自定义工具调用评测: 支持自定义函数调用(function-call)评测能力,参考使用文档
- 自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持,参考使用文档
- 聚合评分: 更新聚合(agg)参数,优化评分聚合机制
- 性能测试: 优化性能测试(perf)相关参数配置
文档优化
- 更新 collection 相关文档说明,支持自定义构建评测指数(index),参考使用文档
问题修复
- 修复 perf completion endpoint streaming 相关问题
- 修复 judge model 错误日志显示问题
- 修复 --no-test-connection 参数 action 问题
- 修复函数调用类测试用例错误处理问题(Issue #1005)
- 修复 model args 相关问题
English Version
Benchmark Datasets
- Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
- Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
- General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks
Feature Enhancements
- Custom Evaluation: Added support for custom function-call evaluation
- Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
- Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
- Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
- Performance Testing: Optimized performance (perf) related parameter configuration
Documentation
- Updated eval_type related documentation
- Updated collection documentation
Bug Fixes
- Fixed perf completion endpoint streaming issues
- Fixed error log display for judge model
- Fixed --no-test-connection parameter action issue
- Fixed error handling for function-call test cases (Issue #1005)
- Fixed model args related issues
What's Changed
- [Doc] Update doc eval_type by @Yunnglin in #970
- [Benchmark] Add A_OKVQA, CMMU, ScienceQ, V*Bench by @mushenL in #973
- [Benchmark] Add SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini by @Yunnglin in #976
- [Feature] Add custom function-call eval by @Yunnglin in #982
- [Fix] perf completion endpoint streaming by @Yunnglin in #983
- [Fix] fix error log of judge model by @Yunnglin in #986
- add openai mrcr by @sophies-cerebras in #987
- [Feature] Add extra param spec by @Yunnglin in #990
- Add gsm8k_v,mgsm and micro_vqa benchmarks by @mushenL in #995
- fix: fix --no-test-connection args action by @ljwh in #999
- Update collection doc by @Yunnglin in #997
- [Benchmark] Add IFBench by @Yunnglin in #1001
- 解决Issue1005 处理函数调用类测试用例错误 问题 by @hougedengwo in #1007
- [Benchmark] Add SciCode by @Yunnglin in #1011
- [Fix] update perf args by @Yunnglin in #1013
- [Fix] model args by @Yunnglin in #1014
- [Feature] update agg args by @Yunnglin in #1016
- [Feature] Add custom VQA by @Yunnglin in #1019
- [Benchmark] add CMMMU by @Yunnglin in #1020
New Contributors
- @ljwh made their first contribution in #999
- @hougedengwo made their first contribution in #1007
Full Changelog: v1.2.0...v1.3.0
v1.2.0
中文版
基准测试数据集
- 新增多个MCQA(多项选择问答)数据集
- 新增Drivelology基准测试
- 更新BFCL-v3,新增支持BFCL-v4基准测试
- 更新tau-bench,新增支持tau2-bench
- 支持WMT机器翻译评测和相关指标
功能增强
- 优化答案提取机制 - 使答案提取过程更加明确和可控
- 支持batch计算指标,例如Bertscore等
- 更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
- 更新OpenAI API参数 - 优化API调用参数配置
数据源更新
- 更新SimpleQA数据源 - 使用最新的SimpleQA数据
- 对齐AIME到AA标准 - 统一评测标准
- 更新MMLU-Pro - 使用最新的MMLU-Pro数据
问题修复
- 修复DROP数据集few_shot_num=3的问题
- 修复缓冲区解码错误 - 解决了decode buffer相关的错误
English Version
Benchmark Datasets
- Added multiple MCQA (Multiple Choice Question Answering) datasets
- Added Drivelology benchmark
- Updated BFCL-v3 and added support for BFCL-v4 benchmark
- Updated tau-bench and added support for tau2-bench
- Added support for WMT machine translation evaluation and related metrics
Feature Enhancements
- Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
- Added support for batch metric computation, such as Bertscore
- Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
- Updated OpenAI API parameters - optimized API call parameter configuration
Data Source Updates
- Updated SimpleQA data source - using the latest SimpleQA data
- Aligned AIME to AA standard - unified evaluation standards
- Updated MMLU-Pro - using the latest MMLU-Pro data
Bug Fixes
- Fixed the issue with DROP dataset when few_shot_num=3
- Fixed buffer decoding error - resolved decode buffer related issues
What's Changed
- [Benchmark] Add MCQA datasets by @penguinwang96825 in #923
- [Benchmark] Add Drivelology benchmark by @penguinwang96825 in #927
- [Benchmark] Add more MCQA datasets by @penguinwang96825 in #928
- [Benchmark] Add BFCL-v4 by @Yunnglin in #934
- fix (dorp allow few_shot_num=3 in dataset args) 当前存在few_shot_num=3时,会… by @yuhuan0311 in #940
- [Fix] update DROP metric by @Yunnglin in #941
- [Feature] Update Bertscore for DrivelologyNarrativeWriting by @Yunnglin in #935
- Update SimpleQA source by @Yunnglin in #948
- [Feature] Update OpenAI API parameter by @Yunnglin in #949
- [Fix] decode buffer error by @Yunnglin in #954
- feat: update WMT adapters and related metrics by @Epsilon617 in #938
- [Benchmark] Update tau-bench and tau2-bench by @Yunnglin in #959
- [Fix] Update mmlu-pro by @Yunnglin in #960
- [Doc] Fixed the configuration error in BFCL-v4 documentation example (#962) by @Tsumugii24 in #963
- align aime to AA by @sophies-cerebras in #965
- [ADD] Implement metric aggregation pass@k and vote@k #387 by @xin8coder in #964
- [Feature] make extract answer explict by @Yunnglin in #966
- [Feature] Update aggregate_scores by @Yunnglin in #967
New Contributors
- @yuhuan0311 made their first contribution in #940
- @Epsilon617 made their first contribution in #938
- @Tsumugii24 made their first contribution in #963
- @xin8coder made their first contribution in #964
Full Changelog: v1.1.1...v1.2.0
v1.1.1
更新
- 基准测试扩展
- 视觉/多模态评测:HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
- 文档理解: OmniDocBench
- NLP任务: CoNLL2003、NER 任务集合(9个任务)、AA-LCR
- 逻辑推理: VisuLogic、ZeroBench
- 功能增强
- 性能基准测试优化:perf 功能优化,可获得与 vLLM benchmarking 相媲美的测试结果,参考使用文档
- 代码评测环境增强:沙箱环境支持本地/远程双模式运行,提升代码安全性与灵活性,参考使用文档
- 性能与稳定性优化
- 修复数据集中 prompt tokens 计算问题
- 增加评测过程中心跳检测机制
- 修复 GSM8K 准确率计算并增强日志记录
- 系统要求更新
- Python版本要求:提升至 ≥3.10 (无依赖更新)
Updates
- Benchmark Extensions
- Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
- Document Understanding: OmniDocBench
- NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
- Logic Reasoning: VisuLogic, ZeroBench
- Feature Enhancements
- Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
- Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation
- Performance and Stability Improvements
- Fixed prompt tokens calculation issues in datasets
- Added heartbeat detection mechanism during evaluation process
- Fixed GSM8K accuracy calculation and enhanced logging
- System Requirements Update
- Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)
What's Changed
- Datasets: prompt tokens count bug fixed by @Aktsvigun in #873
- [Benchmark] Add HallusionBench and POPE by @Yunnglin in #875
- [Feature] Add inflight process by @Yunnglin in #880
- [Benchmark] Add PloyMath by @Yunnglin in #882
- add math_verse math_vision simple_vqa by @mushenL in #881
- fix: update Python version requirement to >=3.10 by @nowang6 in #890
- [Feature] Update perf thoughput by @Yunnglin in #894
- [Feature] Add extra query by @Yunnglin in #895
- add AA-LCR benchmark to evalscope by @sophies-cerebras in #897
- [feature] add
--visualizerparameter instead of --XXX_api_key in stress test by @ShaohonChen in #878 - [Feature] Add sandbox doc by @Yunnglin in #899
- fix gsm8k acc and add more log by @ms-cs in #903
- [Doc] Update writing by @Yunnglin in #904
- [Benchmark] Add OmniDocBench by @Yunnglin in #908
- [Benchmark] Add CoNLL2003 benchmark by @penguinwang96825 in #912
- add seed_bench_2_plus,visu_logic_adapter,zerobench by @mushenL in #916
- [Benchmark] Add NER suite by @penguinwang96825 in #921
- [Feature] Add pred heartbeat by @ms-cs in #922
New Contributors
- @Aktsvigun made their first contribution in #873
- @nowang6 made their first contribution in #890
- @sophies-cerebras made their first contribution in #897
- @ms-cs made their first contribution in #903
- @penguinwang96825 made their first contribution in #912
Full Changelog: v1.1.0...v1.1.1
v1.1.0
更新
- 支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准,所有支持的数据集请参考
- 编写Qwen3-Omni和Qwen3-VL模型评测最佳实践
- 支持
pyproject.toml安装
Update
- The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
- Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
- Installation via
pyproject.tomlis now supported.
What's Changed
- [Doc] Add qwen omni doc by @Yunnglin in #854
- [Fix] Fix bfcl_v3 validation by @Yunnglin in #858
- [Feature] Add pyproject.toml by @Yunnglin in #857
- [Benchmark] Add ChartQA and BLINK by @Yunnglin in #861
- [Benchmark] Add DocVQA and InfoVQA by @Yunnglin in #862
- [Fix] transformers import by @Yunnglin in #865
- [Benchmark] Add OCRBench and OCRBench-v2 by @Yunnglin in #869
- [Fix] None string error by @Yunnglin in #871
Full Changelog: v1.0.2...v1.1.0
v1.0.2
新增功能
- 代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行,要使用该功能需先安装ms-enclave。
- 新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准,和Multi-IF、HealthBench、AMC等纯文本评测基准。
New Features
- Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
- Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.
What's Changed
- [Benchmark] add Multi-IF by @Yunnglin in #822
- Add ai2d_adapter and real_world_qa_adapter by @mushenL in #824
- [Benchmark] Add health bench by @Yunnglin in #826
- fix: make _temp_run top-level to resolve M1 pickle error by @MemoryIt in #827
- [Fix] vlm tokenize by @Yunnglin in #829
- [Doc] update qwen next doc by @Yunnglin in #832
- [Fix] fix bfcl-v3 score by @Yunnglin in #833
- [Benchmark] Add MMBench and MMStar by @mushenL in #834
- [Benchmark] Add Omnibench by @Yunnglin in #837
- [Fix] Fix bfcl validation error by @Yunnglin in #838
- [Feature] add docker sandbox by @Yunnglin in #835
- [Fix] Fix thread pool error by @Yunnglin in #841
- [Benchmark] Add amc23 and OlympiadBench by @mushenL in #840
- [Benchmark] add minerva-math by @Yunnglin in #846
New Contributors
Full Changelog: v1.0.1...v1.0.2
v1.0.1
更新内容
- 支持视觉-语言多模态大模型的评测任务,例如:MathVista、MMMU,更多支持数据集请参考。
- 支持图像编辑任务评测,支持GEdit-Bench 评测基准,使用方法参考。
- 核心依赖移除
torch,移动到rag和aigc可选依赖中。
Update
- The evaluation tasks for vision-language multimodal large models are now supported, including MathVista and MMMU. For more information on the supported datasets, please refer to this link.
- Image editing task evaluation is now supported, with the GEdit-Bench evaluation benchmark available. For usage instructions, please refer to this guide.
- The core dependency on
torchhas been removed and is now an optional dependency underragandaigc.
What's Changed
- [DOC] Update 1.0 custom doc by @Yunnglin in #793
- [Fix] Fix reasoning content by @Yunnglin in #797
- [Fix] Change old collection to new version by @Yunnglin in #798
- Reduce dataset loading time by @mmdbhs in #805
- [Fix] fix reranker pad token and embedding max tokens by @Yunnglin in #806
- [Feature] Add image edit task by @Yunnglin in #804
- [Benchmark] Add mmmu by @Yunnglin in #812
- add math_vista by @mushenL in #813
- [Fix] tau-bench zero scores by @Yunnglin in #814
- [Fix] collection eval by @Yunnglin in #816
- [Feature] add vlm adapter by @Yunnglin in #817
- [Feature] remove torch from framework by @Yunnglin in #818
- add MMMU_Pro by @mushenL in #819
New Contributors
Full Changelog: v1.0.0...v1.0.1
v1.0.0
新版本
版本 1.0 对评测框架进行了重大重构,在 evalscope/api 下建立了全新的、更模块化且易扩展的 API 层。主要改进包括:为基准、样本和结果引入了标准化数据模型;对基准和指标等组件采用注册表式设计;并重写了核心评测器以协同新架构。现有的基准已迁移到这一 API,实现更加简洁、一致且易于维护。
不兼容的更新请参考。
New version
Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.
What's Changed
- [Feature] Add image edit evaluation by @Yunnglin in #725
- [Doc] add tau-bench doc by @Yunnglin in #730
- [Fix] ragas local model by @Yunnglin in #732
- [Doc] Add qwen-code best practice doc by @Yunnglin in #734
- Fix: Incorrect keyword argument in call to csv_to_list() by @Zhuzhenghao in #745
- Add SECURITY.md by @wangxingjun778 in #750
- Update SECURITY.md by @wangxingjun778 in #752
- updata faq file by @mushenL in #744
- [Refactor] v1.0 by @Yunnglin in #739
New Contributors
- @Zhuzhenghao made their first contribution in #745
Full Changelog: v0.17.1...v1.0.0
v0.17.1
新功能
- 模型压测支持随机生成图文数据,用于多模态模型压测,使用方法参考。
- 支持τ-bench,用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性,使用方法参考。
- 支持“人类最后的考试”(Humanity's-Last-Exam),这一高难度评测基准,使用方法参考。
New Features
- The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
- Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
- Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.
What's Changed
- [Feat] add perf sleep interval by @Yunnglin in #699
- [Benchmark] Add HLE by @Yunnglin in #705
- [Benchmark] Add tau-bench by @Yunnglin in #711
- [Feature] Update perf random generation by @Yunnglin in #713
- [Fix] Eval parser: humaneval, mmlu by @Yunnglin in #718
Full Changelog: v0.17.0...v0.17.1