Skip to content

Releases: modelscope/evalscope

v1.4.1

05 Jan 07:28

Choose a tag to compare

中文版

基准测试数据集

  • 命名实体识别: 新增 12 个 NER(命名实体识别)数据集
  • 语音识别: 新增 TORGO 数据集,用于构音障碍语音识别评测,支持 SemScore 评估
  • 多模态评测: 新增 RefCOCO 基准测试
  • 代码评测: 新增 Terminal-bench 终端命令能力评测

功能增强

  • 性能测试: 新增 SLA 自动调优功能,优化性能测试体验
  • 服务模式: 新增异步服务支持和 Gradio UI 界面
  • 数据加载: 优化本地 JSONL 数据集加载功能

问题修复

  • 修复 HallusionBench 数据加载问题
  • 修复流式响应解析中的 SSE 分块处理问题

English Version

Benchmark Datasets

  • Named Entity Recognition: Added 12 NER (Named Entity Recognition) datasets
  • Speech Recognition: Added TORGO dataset for dysarthria speech recognition with SemScore evaluation
  • Multimodal Evaluation: Added RefCOCO referring expression comprehension benchmark
  • Code Evaluation: Added Terminal-bench for terminal command capability assessment

Feature Enhancements

  • Performance Testing: Added SLA auto-tuning functionality to optimize performance testing experience
  • Service Mode: Added asynchronous service support and Gradio UI interface
  • Data Loading: Optimized local JSONL dataset loading functionality

Bug Fixes

  • Fixed HallusionBench data loading issues
  • Fixed SemScore computation errors
  • Fixed eval_config loading related issues

What's Changed

New Contributors

Full Changelog: v1.4.0...v1.4.1

v1.4.0

16 Dec 09:17

Choose a tag to compare

中文版

基准测试数据集

  • 通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
  • 代码评测: 新增 MultiplE、MBPP 等代码能力评测
  • 语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试

功能增强

  • 性能测试可视化: 新增 ClearML 可视化支持,优化性能测试(perf)监控能力
  • 服务API: 新增 service api 功能,提供更灵活的服务调用方式,参考文档
  • 懒加载模型: 新增 lazy model 支持,优化模型加载机制
  • 重试机制: 新增 retry function,提升评测稳定性
  • 沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
  • 随机算法优化: 更新性能测试随机算法,提升测试准确性
  • UI增强: Dashboard 支持 HTTP params 参数配置
  • 进度条优化: 更新 tqdm 进度显示机制

文档优化

  • 更新自定义 VQA 相关文档
  • 更新参数配置相关文档
  • 更新基准测试(benchmarks)文档
  • 更新服务(service)相关文档
  • 更新 MTEB 相关链接

问题修复

  • 修复 --analysis-report、--dataset-dir 等命令行参数问题
  • 修复并发为1时的令牌吞吐量计算问题
  • 修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
  • 修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
  • 修复 NLTK 资源检查相关问题

English Version

Benchmark Datasets

  • General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
  • Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
  • Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks

Feature Enhancements

  • Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
  • Service API: Added service api functionality for more flexible service invocation
  • Lazy Model Loading: Added lazy model support to optimize model loading mechanism
  • Retry Mechanism: Added retry function to improve evaluation stability
  • Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
  • Random Algorithm: Updated performance testing random algorithm for improved accuracy
  • UI Enhancement: Dashboard now supports HTTP params parameter configuration
  • Progress Bar: Updated tqdm progress display mechanism

Documentation

  • Updated custom VQA documentation
  • Updated parameter configuration documentation
  • Updated benchmarks documentation
  • Updated service documentation
  • Updated MTEB related links

Bug Fixes

  • Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
  • Fixed token throughput calculation at concurrency 1
  • Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
  • Fixed SWE-bench image build and MRCR leading newline support
  • Fixed NLTK resource checking issues

What's Changed

New Contributors

Full Changelog: v1.3.0...v1.4.0

v1.3.0

28 Nov 07:51

Choose a tag to compare

中文版

基准测试数据集

  • 多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
  • 代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
  • 通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试

功能增强

  • 自定义工具调用评测: 支持自定义函数调用(function-call)评测能力,参考使用文档
  • 自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持,参考使用文档
  • 聚合评分: 更新聚合(agg)参数,优化评分聚合机制
  • 性能测试: 优化性能测试(perf)相关参数配置

文档优化

  • 更新 collection 相关文档说明,支持自定义构建评测指数(index),参考使用文档

问题修复

  • 修复 perf completion endpoint streaming 相关问题
  • 修复 judge model 错误日志显示问题
  • 修复 --no-test-connection 参数 action 问题
  • 修复函数调用类测试用例错误处理问题(Issue #1005)
  • 修复 model args 相关问题

English Version

Benchmark Datasets

  • Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
  • Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
  • General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks

Feature Enhancements

  • Custom Evaluation: Added support for custom function-call evaluation
  • Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
  • Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
  • Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
  • Performance Testing: Optimized performance (perf) related parameter configuration

Documentation

  • Updated eval_type related documentation
  • Updated collection documentation

Bug Fixes

  • Fixed perf completion endpoint streaming issues
  • Fixed error log display for judge model
  • Fixed --no-test-connection parameter action issue
  • Fixed error handling for function-call test cases (Issue #1005)
  • Fixed model args related issues

What's Changed

New Contributors

Full Changelog: v1.2.0...v1.3.0

v1.2.0

11 Nov 04:58

Choose a tag to compare

中文版

基准测试数据集

  • 新增多个MCQA(多项选择问答)数据集
  • 新增Drivelology基准测试
  • 更新BFCL-v3,新增支持BFCL-v4基准测试
  • 更新tau-bench,新增支持tau2-bench
  • 支持WMT机器翻译评测和相关指标

功能增强

  • 优化答案提取机制 - 使答案提取过程更加明确和可控
  • 支持batch计算指标,例如Bertscore等
  • 更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
  • 更新OpenAI API参数 - 优化API调用参数配置

数据源更新

  • 更新SimpleQA数据源 - 使用最新的SimpleQA数据
  • 对齐AIME到AA标准 - 统一评测标准
  • 更新MMLU-Pro - 使用最新的MMLU-Pro数据

问题修复

  • 修复DROP数据集few_shot_num=3的问题
  • 修复缓冲区解码错误 - 解决了decode buffer相关的错误

English Version

Benchmark Datasets

  • Added multiple MCQA (Multiple Choice Question Answering) datasets
  • Added Drivelology benchmark
  • Updated BFCL-v3 and added support for BFCL-v4 benchmark
  • Updated tau-bench and added support for tau2-bench
  • Added support for WMT machine translation evaluation and related metrics

Feature Enhancements

  • Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
  • Added support for batch metric computation, such as Bertscore
  • Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
  • Updated OpenAI API parameters - optimized API call parameter configuration

Data Source Updates

  • Updated SimpleQA data source - using the latest SimpleQA data
  • Aligned AIME to AA standard - unified evaluation standards
  • Updated MMLU-Pro - using the latest MMLU-Pro data

Bug Fixes

  • Fixed the issue with DROP dataset when few_shot_num=3
  • Fixed buffer decoding error - resolved decode buffer related issues

What's Changed

New Contributors

Full Changelog: v1.1.1...v1.2.0

v1.1.1

27 Oct 09:11

Choose a tag to compare

更新

  1. 基准测试扩展
  • 视觉/多模态评测:HallusionBench、POPE、PloyMath、MathVerse、MathVision、SimpleVQA、SeedBench2_plus
  • 文档理解: OmniDocBench
  • NLP任务: CoNLL2003、NER 任务集合(9个任务)、AA-LCR
  • 逻辑推理: VisuLogic、ZeroBench
  1. 功能增强
  • 性能基准测试优化:perf 功能优化,可获得与 vLLM benchmarking 相媲美的测试结果,参考使用文档
  • 代码评测环境增强:沙箱环境支持本地/远程双模式运行,提升代码安全性与灵活性,参考使用文档
  1. 性能与稳定性优化
  • 修复数据集中 prompt tokens 计算问题
  • 增加评测过程中心跳检测机制
  • 修复 GSM8K 准确率计算并增强日志记录
  1. 系统要求更新
  • Python版本要求:提升至 ≥3.10 (无依赖更新)

Updates

  1. Benchmark Extensions
  • Vision/Multimodal Evaluation: HallusionBench, POPE, PloyMath, MathVerse, MathVision, SimpleVQA, SeedBench2_plus
  • Document Understanding: OmniDocBench
  • NLP Tasks: CoNLL2003, NER Task Collection (9 tasks), AA-LCR
  • Logic Reasoning: VisuLogic, ZeroBench
  1. Feature Enhancements
  • Optimized perf functionality to achieve results comparable to vllm benchmarking, see documentation
  • Enhanced sandbox environment usage in code evaluation, supporting both local and remote execution modes, see documentation
  1. Performance and Stability Improvements
  • Fixed prompt tokens calculation issues in datasets
  • Added heartbeat detection mechanism during evaluation process
  • Fixed GSM8K accuracy calculation and enhanced logging
  1. System Requirements Update
  • Python Version Requirement: Upgraded to ≥3.10 (no dependency updates)

What's Changed

New Contributors

Full Changelog: v1.1.0...v1.1.1

v1.1.0

14 Oct 09:20

Choose a tag to compare

更新

  • 支持OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK 等图文多模态评测基准,所有支持的数据集请参考
  • 编写Qwen3-OmniQwen3-VL模型评测最佳实践
  • 支持pyproject.toml安装

Update

  • The platform now supports OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, BLINK, and other multimodal evaluation benchmarks. For a comprehensive list of supported datasets, please refer.
  • Developed best practice guidelines for evaluating models with Qwen3-Omni and Qwen3-VL.
  • Installation via pyproject.toml is now supported.

What's Changed

Full Changelog: v1.0.2...v1.1.0

v1.0.2

23 Sep 09:30

Choose a tag to compare

新增功能

  • 代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行,要使用该功能需先安装ms-enclave
  • 新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准,和Multi-IF、HealthBench、AMC等纯文本评测基准。

New Features

  • Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
  • Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.

What's Changed

New Contributors

Full Changelog: v1.0.1...v1.0.2

v1.0.1

05 Sep 09:11

Choose a tag to compare

更新内容

  • 支持视觉-语言多模态大模型的评测任务,例如:MathVista、MMMU,更多支持数据集请参考
  • 支持图像编辑任务评测,支持GEdit-Bench 评测基准,使用方法参考
  • 核心依赖移除torch,移动到ragaigc可选依赖中。

Update

  • The evaluation tasks for vision-language multimodal large models are now supported, including MathVista and MMMU. For more information on the supported datasets, please refer to this link.
  • Image editing task evaluation is now supported, with the GEdit-Bench evaluation benchmark available. For usage instructions, please refer to this guide.
  • The core dependency on torch has been removed and is now an optional dependency under rag and aigc.

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.0.1

v1.0.0

25 Aug 06:50

Choose a tag to compare

新版本

版本 1.0 对评测框架进行了重大重构,在 evalscope/api 下建立了全新的、更模块化且易扩展的 API 层。主要改进包括:为基准、样本和结果引入了标准化数据模型;对基准和指标等组件采用注册表式设计;并重写了核心评测器以协同新架构。现有的基准已迁移到这一 API,实现更加简洁、一致且易于维护。

不兼容的更新请参考

New version

Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.

What's Changed

New Contributors

Full Changelog: v0.17.1...v1.0.0

v0.17.1

21 Jul 02:10

Choose a tag to compare

新功能

  • 模型压测支持随机生成图文数据,用于多模态模型压测,使用方法参考
  • 支持τ-bench,用于评估 AI Agent在动态用户和工具交互的实际环境中的性能和可靠性,使用方法参考
  • 支持“人类最后的考试”(Humanity's-Last-Exam),这一高难度评测基准,使用方法参考

New Features

  • The model stress testing now supports randomly generated image-text data for multimodal model stress testing. For usage instructions, see here.
  • Support for τ-bench has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, see here.
  • Support for "Humanity's Last Exam", a high-difficulty evaluation benchmark, has been added. For usage instructions, see here.

What's Changed

Full Changelog: v0.17.0...v0.17.1