Releases · modelscope/evalscope

05 Jan 07:28

Yunnglin

v1.4.1

d498b20

v1.4.1 Latest

Latest

中文版

基准测试数据集

命名实体识别: 新增 12 个 NER（命名实体识别）数据集
语音识别: 新增 TORGO 数据集，用于构音障碍语音识别评测，支持 SemScore 评估
多模态评测: 新增 RefCOCO 基准测试
代码评测: 新增 Terminal-bench 终端命令能力评测

功能增强

性能测试: 新增 SLA 自动调优功能，优化性能测试体验
服务模式: 新增异步服务支持和 Gradio UI 界面
数据加载: 优化本地 JSONL 数据集加载功能

问题修复

修复 HallusionBench 数据加载问题
修复流式响应解析中的 SSE 分块处理问题

English Version

Benchmark Datasets

Named Entity Recognition: Added 12 NER (Named Entity Recognition) datasets
Speech Recognition: Added TORGO dataset for dysarthria speech recognition with SemScore evaluation
Multimodal Evaluation: Added RefCOCO referring expression comprehension benchmark
Code Evaluation: Added Terminal-bench for terminal command capability assessment

Feature Enhancements

Performance Testing: Added SLA auto-tuning functionality to optimize performance testing experience
Service Mode: Added asynchronous service support and Gradio UI interface
Data Loading: Optimized local JSONL dataset loading functionality

Bug Fixes

Fixed HallusionBench data loading issues
Fixed SemScore computation errors
Fixed eval_config loading related issues

What's Changed

[Fix] hallusion_bench load data by @Yunnglin in #1092
[Feature] Add perf SLA auto tune by @Yunnglin in #1095
[Feature] add service async and gradio ui by @Yunnglin in #1103
fix(streaming): Robust parsing of SSE chunks with multiple events and \r\n normalization by @amumu96 in #1102
Add 12 NER Datasets by @penguinwang96825 in #1106
[Benchmark] Add TORGO Dataset for Dysarthria Speech Recognition with SemScore Evaluation by @penguinwang96825 in #1107
[Benchmark] Add RefCOCO by @mushenL in #1109
[Fix] computation error in SemScore by @penguinwang96825 in #1110
[Feature] Update load local jsonl by @Yunnglin in #1111
[Fix] eval_config load by @Yunnglin in #1116
[Benchmark] Add terminal-bench by @Yunnglin in #1114

New Contributors

@amumu96 made their first contribution in #1102

Full Changelog: v1.4.0...v1.4.1

Contributors

penguinwang96825, Yunnglin, and 2 other contributors

Assets 2

16 Dec 09:17

Yunnglin

v1.4.0

58eb425

v1.4.0

中文版

基准测试数据集

通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
代码评测: 新增 MultiplE、MBPP 等代码能力评测
语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试

功能增强

性能测试可视化: 新增 ClearML 可视化支持，优化性能测试(perf)监控能力
服务API: 新增 service api 功能，提供更灵活的服务调用方式，参考文档
懒加载模型: 新增 lazy model 支持，优化模型加载机制
重试机制: 新增 retry function，提升评测稳定性
沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
随机算法优化: 更新性能测试随机算法，提升测试准确性
UI增强: Dashboard 支持 HTTP params 参数配置
进度条优化: 更新 tqdm 进度显示机制

文档优化

更新自定义 VQA 相关文档
更新参数配置相关文档
更新基准测试(benchmarks)文档
更新服务(service)相关文档
更新 MTEB 相关链接

问题修复

修复 --analysis-report、--dataset-dir 等命令行参数问题
修复并发为1时的令牌吞吐量计算问题
修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
修复 NLTK 资源检查相关问题

English Version

Benchmark Datasets

General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks

Feature Enhancements

Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
Service API: Added service api functionality for more flexible service invocation
Lazy Model Loading: Added lazy model support to optimize model loading mechanism
Retry Mechanism: Added retry function to improve evaluation stability
Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
Random Algorithm: Updated performance testing random algorithm for improved accuracy
UI Enhancement: Dashboard now supports HTTP params parameter configuration
Progress Bar: Updated tqdm progress display mechanism

Documentation

Updated custom VQA documentation
Updated parameter configuration documentation
Updated benchmarks documentation
Updated service documentation
Updated MTEB related links

Bug Fixes

Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
Fixed token throughput calculation at concurrency 1
Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
Fixed SWE-bench image build and MRCR leading newline support
Fixed NLTK resource checking issues

What's Changed

[Feature] Add perf ClearML visualization by @Yunnglin in #1032
[Doc] update custom vqa by @Yunnglin in #1036
[Benchmark ]Add eq bench by @Yunnglin in #1037
Feature/zebralogicbench by @nhes in #1035
[Fix] Update tau2 by @Yunnglin in #1039
[Feature] Add service api by @Yunnglin in #1042
fix --analysis-report=true bug by @pumpkin12135 in #1046
[feature] add lazy model by @Secbone in #1045
[Doc] update parameter by @Yunnglin in #1048
[Fix] update default work dir by @Yunnglin in #1049
[Feature] Update perf random Algorithm by @Yunnglin in #1050
[Feature] add retry function by @Yunnglin in #1051
Fix --dataset-dir parameter to work correctly by @gbdjxgp in #1053
[Fix] chartqa prompt by @Yunnglin in #1054
[Benchmark] Add fleurs, librispeech by @Yunnglin in #1059
[Fix] multi-if load by @Yunnglin in #1062
[Benchmakr] Add MultiplE-mbpp, MBPP by @Yunnglin in #1066
Update mteb link by @Samoed in #1065
UI dashboard supports HTTP params parameters by @pumpkin12135 in #1060
[Feature] update sandbox with pool and multiple-humaneval by @Yunnglin in #1073
check_nltk_data does not accept a parameter by @Zhaoyi-Yan in #1071
[Fix] update check nltk resource by @Yunnglin in #1078
[Fix] omni doc bench load by @Yunnglin in #1079
[Doc] update benchmarks doc by @Yunnglin in #1081
[Doc] update service and doc by @Yunnglin in #1085
fix: Output token throughput and Total token throughput on Concurrency 1 by @cdpath in #1083
[Fix] SWE build image by @Yunnglin in #1087
small fixes to mrcr to support leading \n characters by @sophies-cerebras in #1086
[Feature] Update tqdm process by @Yunnglin in #1089

New Contributors

@nhes made their first contribution in #1035
@pumpkin12135 made their first contribution in #1046
@Secbone made their first contribution in #1045
@gbdjxgp made their first contribution in #1053
@Samoed made their first contribution in #1065
@Zhaoyi-Yan made their first contribution in #1071
@cdpath made their first contribution in #1083

Full Changelog: v1.3.0...v1.4.0

Contributors

Secbone, cdpath, and 7 other contributors

Assets 2

28 Nov 07:51

Yunnglin

v1.3.0

5cdbe62

v1.3.0

中文版

基准测试数据集

多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试

功能增强

自定义工具调用评测: 支持自定义函数调用(function-call)评测能力，参考使用文档
自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持，参考使用文档
聚合评分: 更新聚合(agg)参数，优化评分聚合机制
性能测试: 优化性能测试(perf)相关参数配置

文档优化

更新 collection 相关文档说明，支持自定义构建评测指数（index），参考使用文档

问题修复

修复 perf completion endpoint streaming 相关问题
修复 judge model 错误日志显示问题
修复 --no-test-connection 参数 action 问题
修复函数调用类测试用例错误处理问题(Issue #1005)
修复 model args 相关问题

English Version

Benchmark Datasets

Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks

Feature Enhancements

Custom Evaluation: Added support for custom function-call evaluation
Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
Performance Testing: Optimized performance (perf) related parameter configuration

Documentation

Updated eval_type related documentation
Updated collection documentation

Bug Fixes

Fixed perf completion endpoint streaming issues
Fixed error log display for judge model
Fixed --no-test-connection parameter action issue
Fixed error handling for function-call test cases (Issue #1005)
Fixed model args related issues

What's Changed

[Doc] Update doc eval_type by @Yunnglin in #970
[Benchmark] Add A_OKVQA, CMMU, ScienceQ, V*Bench by @mushenL in #973
[Benchmark] Add SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini by @Yunnglin in #976
[Feature] Add custom function-call eval by @Yunnglin in #982
[Fix] perf completion endpoint streaming by @Yunnglin in #983
[Fix] fix error log of judge model by @Yunnglin in #986
add openai mrcr by @sophies-cerebras in #987
[Feature] Add extra param spec by @Yunnglin in #990
Add gsm8k_v,mgsm and micro_vqa benchmarks by @mushenL in #995
fix: fix --no-test-connection args action by @ljwh in #999
Update collection doc by @Yunnglin in #997
[Benchmark] Add IFBench by @Yunnglin in #1001
解决Issue1005 处理函数调用类测试用例错误问题 by @hougedengwo in #1007
[Benchmark] Add SciCode by @Yunnglin in #1011
[Fix] update perf args by @Yunnglin in #1013
[Fix] model args by @Yunnglin in #1014
[Feature] update agg args by @Yunnglin in #1016
[Feature] Add custom VQA by @Yunnglin in #1019
[Benchmark] add CMMMU by @Yunnglin in #1020

New Contributors

@ljwh made their first contribution in #999
@hougedengwo made their first contribution in #1007

Full Changelog: v1.2.0...v1.3.0

Contributors

ljwh, hougedengwo, and 3 other contributors

Assets 2

11 Nov 04:58

Yunnglin

v1.2.0

9623ce2

v1.2.0

中文版

基准测试数据集

新增多个MCQA（多项选择问答）数据集
新增Drivelology基准测试
更新BFCL-v3，新增支持BFCL-v4基准测试
更新tau-bench，新增支持tau2-bench
支持WMT机器翻译评测和相关指标

功能增强

优化答案提取机制 - 使答案提取过程更加明确和可控
支持batch计算指标，例如Bertscore等
更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
更新OpenAI API参数 - 优化API调用参数配置

数据源更新

更新SimpleQA数据源 - 使用最新的SimpleQA数据
对齐AIME到AA标准 - 统一评测标准
更新MMLU-Pro - 使用最新的MMLU-Pro数据

问题修复

修复DROP数据集few_shot_num=3的问题
修复缓冲区解码错误 - 解决了decode buffer相关的错误

English Version

Benchmark Datasets

Added multiple MCQA (Multiple Choice Question Answering) datasets
Added Drivelology benchmark
Updated BFCL-v3 and added support for BFCL-v4 benchmark
Updated tau-bench and added support for tau2-bench
Added support for WMT machine translation evaluation and related metrics

Feature Enhancements

Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
Added support for batch metric computation, such as Bertscore
Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
Updated OpenAI API parameters - optimized API call parameter configuration

Data Source Updates

Updated SimpleQA data source - using the latest SimpleQA data
Aligned AIME to AA standard - unified evaluation standards
Updated MMLU-Pro - using the latest MMLU-Pro data

Bug Fixes

Fixed the issue with DROP dataset when few_shot_num=3
Fixed buffer decoding error - resolved decode buffer related issues

What's Changed

[Benchmark] Add MCQA datasets by @penguinwang96825 in #923
[Benchmark] Add Drivelology benchmark by @penguinwang96825 in #927
[Benchmark] Add more MCQA datasets by @penguinwang96825 in #928
[Benchmark] Add BFCL-v4 by @Yunnglin in #934
fix (dorp allow few_shot_num=3 in dataset args) 当前存在few_shot_num=3时，会… by @yuhuan0311 in #940
[Fix] update DROP metric by @Yunnglin in #941
[Feature] Update Bertscore for DrivelologyNarrativeWriting by @Yunnglin in #935
Update SimpleQA source by @Yunnglin in #948
[Feature] Update OpenAI API parameter by @Yunnglin in #949
[Fix] decode buffer error by @Yunnglin in #954
feat: update WMT adapters and related metrics by @Epsilon617 in #938
[Benchmark] Update tau-bench and tau2-bench by @Yunnglin in #959
[Fix] Update mmlu-pro by @Yunnglin in #960
[Doc] Fixed the configuration error in BFCL-v4 documentation example (#962) by @Tsumugii24 in #963
align aime to AA by @sophies-cerebras in #965
[ADD] Implement metric aggregation pass@k and vote@k #387 by @xin8coder in #964
[Feature] make extract answer explict by @Yunnglin in #966
[Feature] Update aggregate_scores by @Yunnglin in #967

New Contributors

@yuhuan0311 made their first contribution in #940
@Epsilon617 made their first contribution in #938
@Tsumugii24 made their first contribution in #963
@xin8coder made their first contribution in #964

Full Changelog: v1.1.1...v1.2.0

Contributors

penguinwang96825, Yunnglin, and 5 other contributors

Assets 2

27 Oct 09:11

Yunnglin

v1.1.1

87b50a9

v1.1.1

更新

基准测试扩展

视觉/多模态评测：HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
文档理解： OmniDocBench
NLP任务： CoNLL2003、NER 任务集合（9个任务）、AA-LCR
逻辑推理： VisuLogic、ZeroBench

功能增强

性能基准测试优化：perf 功能优化，可获得与 vLLM benchmarking 相媲美的测试结果，参考使用文档
代码评测环境增强：沙箱环境支持本地/远程双模式运行，提升代码安全性与灵活性，参考使用文档

性能与稳定性优化

修复数据集中 prompt tokens 计算问题
增加评测过程中心跳检测机制
修复 GSM8K 准确率计算并增强日志记录

系统要求更新

Python版本要求：提升至 ≥3.10 （无依赖更新）

Updates

Benchmark Extensions

Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
Document Understanding: OmniDocBench
NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
Logic Reasoning: VisuLogic, ZeroBench

Feature Enhancements

Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation

Performance and Stability Improvements

Fixed prompt tokens calculation issues in datasets
Added heartbeat detection mechanism during evaluation process
Fixed GSM8K accuracy calculation and enhanced logging

System Requirements Update

Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)

What's Changed

Datasets: prompt tokens count bug fixed by @Aktsvigun in #873
[Benchmark] Add HallusionBench and POPE by @Yunnglin in #875
[Feature] Add inflight process by @Yunnglin in #880
[Benchmark] Add PloyMath by @Yunnglin in #882
add math_verse math_vision simple_vqa by @mushenL in #881
fix: update Python version requirement to >=3.10 by @nowang6 in #890
[Feature] Update perf thoughput by @Yunnglin in #894
[Feature] Add extra query by @Yunnglin in #895
add AA-LCR benchmark to evalscope by @sophies-cerebras in #897
[feature] add --visualizer parameter instead of --XXX_api_key in stress test by @ShaohonChen in #878
[Feature] Add sandbox doc by @Yunnglin in #899
fix gsm8k acc and add more log by @ms-cs in #903
[Doc] Update writing by @Yunnglin in #904
[Benchmark] Add OmniDocBench by @Yunnglin in #908
[Benchmark] Add CoNLL2003 benchmark by @penguinwang96825 in #912
add seed_bench_2_plus,visu_logic_adapter,zerobench by @mushenL in #916
[Benchmark] Add NER suite by @penguinwang96825 in #921
[Feature] Add pred heartbeat by @ms-cs in #922

New Contributors

@Aktsvigun made their first contribution in #873
@nowang6 made their first contribution in #890
@sophies-cerebras made their first contribution in #897
@ms-cs made their first contribution in #903
@penguinwang96825 made their first contribution in #912

Full Changelog: v1.1.0...v1.1.1

Contributors

penguinwang96825, ShaohonChen, and 6 other contributors

Assets 2

14 Oct 09:20

Yunnglin

v1.1.0

aca9756

v1.1.0

更新

支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准，所有支持的数据集请参考
编写Qwen3-Omni和Qwen3-VL模型评测最佳实践
支持pyproject.toml安装

Update

The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
Installation via pyproject.toml is now supported.

What's Changed

[Doc] Add qwen omni doc by @Yunnglin in #854
[Fix] Fix bfcl_v3 validation by @Yunnglin in #858
[Feature] Add pyproject.toml by @Yunnglin in #857
[Benchmark] Add ChartQA and BLINK by @Yunnglin in #861
[Benchmark] Add DocVQA and InfoVQA by @Yunnglin in #862
[Fix] transformers import by @Yunnglin in #865
[Benchmark] Add OCRBench and OCRBench-v2 by @Yunnglin in #869
[Fix] None string error by @Yunnglin in #871

Full Changelog: v1.0.2...v1.1.0

Contributors

Yunnglin

Assets 2

23 Sep 09:30

Yunnglin

v1.0.2

ac3a470

v1.0.2

新增功能

代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行，要使用该功能需先安装ms-enclave。
新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准，和Multi-IF、HealthBench、AMC等纯文本评测基准。

New Features

Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.

What's Changed

[Benchmark] add Multi-IF by @Yunnglin in #822
Add ai2d_adapter and real_world_qa_adapter by @mushenL in #824
[Benchmark] Add health bench by @Yunnglin in #826
fix: make _temp_run top-level to resolve M1 pickle error by @MemoryIt in #827
[Fix] vlm tokenize by @Yunnglin in #829
[Doc] update qwen next doc by @Yunnglin in #832
[Fix] fix bfcl-v3 score by @Yunnglin in #833
[Benchmark] Add MMBench and MMStar by @mushenL in #834
[Benchmark] Add Omnibench by @Yunnglin in #837
[Fix] Fix bfcl validation error by @Yunnglin in #838
[Feature] add docker sandbox by @Yunnglin in #835
[Fix] Fix thread pool error by @Yunnglin in #841
[Benchmark] Add amc23 and OlympiadBench by @mushenL in #840
[Benchmark] add minerva-math by @Yunnglin in #846

New Contributors

@MemoryIt made their first contribution in #827

Full Changelog: v1.0.1...v1.0.2

Contributors

Yunnglin, MemoryIt, and mushenL

Assets 2

05 Sep 09:11

Yunnglin

v1.0.1

bebf960

v1.0.1

更新内容

支持视觉-语言多模态大模型的评测任务，例如：MathVista、MMMU，更多支持数据集请参考。
支持图像编辑任务评测，支持GEdit-Bench 评测基准，使用方法参考。
核心依赖移除torch，移动到rag和aigc可选依赖中。

Update

The evaluation tasks for vision-language multimodal large models are now supported, including MathVista and MMMU. For more information on the supported datasets, please refer to this link.
Image editing task evaluation is now supported, with the GEdit-Bench evaluation benchmark available. For usage instructions, please refer to this guide.
The core dependency on torch has been removed and is now an optional dependency under rag and aigc.

What's Changed

[DOC] Update 1.0 custom doc by @Yunnglin in #793
[Fix] Fix reasoning content by @Yunnglin in #797
[Fix] Change old collection to new version by @Yunnglin in #798
Reduce dataset loading time by @mmdbhs in #805
[Fix] fix reranker pad token and embedding max tokens by @Yunnglin in #806
[Feature] Add image edit task by @Yunnglin in #804
[Benchmark] Add mmmu by @Yunnglin in #812
add math_vista by @mushenL in #813
[Fix] tau-bench zero scores by @Yunnglin in #814
[Fix] collection eval by @Yunnglin in #816
[Feature] add vlm adapter by @Yunnglin in #817
[Feature] remove torch from framework by @Yunnglin in #818
add MMMU_Pro by @mushenL in #819

New Contributors

@mmdbhs made their first contribution in #805

Full Changelog: v1.0.0...v1.0.1

Contributors

Yunnglin, mmdbhs, and mushenL

Assets 2

25 Aug 06:50

Yunnglin

v1.0.0

cceafe6

v1.0.0

新版本

版本 1.0 对评测框架进行了重大重构，在 evalscope/api 下建立了全新的、更模块化且易扩展的 API 层。主要改进包括：为基准、样本和结果引入了标准化数据模型；对基准和指标等组件采用注册表式设计；并重写了核心评测器以协同新架构。现有的基准已迁移到这一 API，实现更加简洁、一致且易于维护。

不兼容的更新请参考。

New version

Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.

What's Changed

[Feature] Add image edit evaluation by @Yunnglin in #725
[Doc] add tau-bench doc by @Yunnglin in #730
[Fix] ragas local model by @Yunnglin in #732
[Doc] Add qwen-code best practice doc by @Yunnglin in #734
Fix: Incorrect keyword argument in call to csv_to_list() by @Zhuzhenghao in #745
Add SECURITY.md by @wangxingjun778 in #750
Update SECURITY.md by @wangxingjun778 in #752
updata faq file by @mushenL in #744
[Refactor] v1.0 by @Yunnglin in #739

New Contributors

@Zhuzhenghao made their first contribution in #745

Full Changelog: v0.17.1...v1.0.0

Contributors

wangxingjun778, Yunnglin, and 2 other contributors

Assets 2

21 Jul 02:10

Yunnglin

v0.17.1

029cc1c

v0.17.1

新功能

模型压测支持随机生成图文数据，用于多模态模型压测，使用方法参考。
支持τ-bench，用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性，使用方法参考。
支持“人类最后的考试”(Humanity's-Last-Exam)，这一高难度评测基准，使用方法参考。

New Features

The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.

What's Changed

[Feat] add perf sleep interval by @Yunnglin in #699
[Benchmark] Add HLE by @Yunnglin in #705
[Benchmark] Add tau-bench by @Yunnglin in #711
[Feature] Update perf random generation by @Yunnglin in #713
[Fix] Eval parser: humaneval, mmlu by @Yunnglin in #718

Full Changelog: v0.17.0...v0.17.1

Contributors

Yunnglin

Assets 2

Releases: modelscope/evalscope

v1.4.1

中文版

基准测试数据集

功能增强

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.4.0

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Documentation

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.3.0

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Documentation

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.2.0

中文版

基准测试数据集

功能增强

数据源更新

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Data Source Updates

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!

v1.1.1

更新

Updates

What's Changed

New Contributors

Contributors

Uh oh!

v1.1.0

更新

Update

What's Changed

Contributors

Uh oh!

v1.0.2

新增功能

New Features

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.1