中文版
基准测试数据集
- 新增多个MCQA(多项选择问答)数据集
- 新增Drivelology基准测试
- 更新BFCL-v3,新增支持BFCL-v4基准测试
- 更新tau-bench,新增支持tau2-bench
- 支持WMT机器翻译评测和相关指标
功能增强
- 优化答案提取机制 - 使答案提取过程更加明确和可控
- 支持batch计算指标,例如Bertscore等
- 更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
- 更新OpenAI API参数 - 优化API调用参数配置
数据源更新
- 更新SimpleQA数据源 - 使用最新的SimpleQA数据
- 对齐AIME到AA标准 - 统一评测标准
- 更新MMLU-Pro - 使用最新的MMLU-Pro数据
问题修复
- 修复DROP数据集few_shot_num=3的问题
- 修复缓冲区解码错误 - 解决了decode buffer相关的错误
English Version
Benchmark Datasets
- Added multiple MCQA (Multiple Choice Question Answering) datasets
- Added Drivelology benchmark
- Updated BFCL-v3 and added support for BFCL-v4 benchmark
- Updated tau-bench and added support for tau2-bench
- Added support for WMT machine translation evaluation and related metrics
Feature Enhancements
- Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
- Added support for batch metric computation, such as Bertscore
- Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
- Updated OpenAI API parameters - optimized API call parameter configuration
Data Source Updates
- Updated SimpleQA data source - using the latest SimpleQA data
- Aligned AIME to AA standard - unified evaluation standards
- Updated MMLU-Pro - using the latest MMLU-Pro data
Bug Fixes
- Fixed the issue with DROP dataset when few_shot_num=3
- Fixed buffer decoding error - resolved decode buffer related issues
What's Changed
- [Benchmark] Add MCQA datasets by @penguinwang96825 in #923
- [Benchmark] Add Drivelology benchmark by @penguinwang96825 in #927
- [Benchmark] Add more MCQA datasets by @penguinwang96825 in #928
- [Benchmark] Add BFCL-v4 by @Yunnglin in #934
- fix (dorp allow few_shot_num=3 in dataset args) 当前存在few_shot_num=3时,会… by @yuhuan0311 in #940
- [Fix] update DROP metric by @Yunnglin in #941
- [Feature] Update Bertscore for DrivelologyNarrativeWriting by @Yunnglin in #935
- Update SimpleQA source by @Yunnglin in #948
- [Feature] Update OpenAI API parameter by @Yunnglin in #949
- [Fix] decode buffer error by @Yunnglin in #954
- feat: update WMT adapters and related metrics by @Epsilon617 in #938
- [Benchmark] Update tau-bench and tau2-bench by @Yunnglin in #959
- [Fix] Update mmlu-pro by @Yunnglin in #960
- [Doc] Fixed the configuration error in BFCL-v4 documentation example (#962) by @Tsumugii24 in #963
- align aime to AA by @sophies-cerebras in #965
- [ADD] Implement metric aggregation pass@k and vote@k #387 by @xin8coder in #964
- [Feature] make extract answer explict by @Yunnglin in #966
- [Feature] Update aggregate_scores by @Yunnglin in #967
New Contributors
- @yuhuan0311 made their first contribution in #940
- @Epsilon617 made their first contribution in #938
- @Tsumugii24 made their first contribution in #963
- @xin8coder made their first contribution in #964
Full Changelog: v1.1.1...v1.2.0