v1.2.0

Latest

Latest

Yunnglin released this 11 Nov 04:58

9623ce2

中文版

基准测试数据集

新增多个MCQA（多项选择问答）数据集
新增Drivelology基准测试
更新BFCL-v3，新增支持BFCL-v4基准测试
更新tau-bench，新增支持tau2-bench
支持WMT机器翻译评测和相关指标

功能增强

优化答案提取机制 - 使答案提取过程更加明确和可控
支持batch计算指标，例如Bertscore等
更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
更新OpenAI API参数 - 优化API调用参数配置

数据源更新

更新SimpleQA数据源 - 使用最新的SimpleQA数据
对齐AIME到AA标准 - 统一评测标准
更新MMLU-Pro - 使用最新的MMLU-Pro数据

问题修复

修复DROP数据集few_shot_num=3的问题
修复缓冲区解码错误 - 解决了decode buffer相关的错误

English Version

Benchmark Datasets

Added multiple MCQA (Multiple Choice Question Answering) datasets
Added Drivelology benchmark
Updated BFCL-v3 and added support for BFCL-v4 benchmark
Updated tau-bench and added support for tau2-bench
Added support for WMT machine translation evaluation and related metrics

Feature Enhancements

Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
Added support for batch metric computation, such as Bertscore
Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
Updated OpenAI API parameters - optimized API call parameter configuration

Data Source Updates

Updated SimpleQA data source - using the latest SimpleQA data
Aligned AIME to AA standard - unified evaluation standards
Updated MMLU-Pro - using the latest MMLU-Pro data

Bug Fixes

Fixed the issue with DROP dataset when few_shot_num=3
Fixed buffer decoding error - resolved decode buffer related issues

What's Changed

[Benchmark] Add MCQA datasets by @penguinwang96825 in #923
[Benchmark] Add Drivelology benchmark by @penguinwang96825 in #927
[Benchmark] Add more MCQA datasets by @penguinwang96825 in #928
[Benchmark] Add BFCL-v4 by @Yunnglin in #934
fix (dorp allow few_shot_num=3 in dataset args) 当前存在few_shot_num=3时，会… by @yuhuan0311 in #940
[Fix] update DROP metric by @Yunnglin in #941
[Feature] Update Bertscore for DrivelologyNarrativeWriting by @Yunnglin in #935
Update SimpleQA source by @Yunnglin in #948
[Feature] Update OpenAI API parameter by @Yunnglin in #949
[Fix] decode buffer error by @Yunnglin in #954
feat: update WMT adapters and related metrics by @Epsilon617 in #938
[Benchmark] Update tau-bench and tau2-bench by @Yunnglin in #959
[Fix] Update mmlu-pro by @Yunnglin in #960
[Doc] Fixed the configuration error in BFCL-v4 documentation example (#962) by @Tsumugii24 in #963
align aime to AA by @sophies-cerebras in #965
[ADD] Implement metric aggregation pass@k and vote@k #387 by @xin8coder in #964
[Feature] make extract answer explict by @Yunnglin in #966
[Feature] Update aggregate_scores by @Yunnglin in #967

New Contributors

@yuhuan0311 made their first contribution in #940
@Epsilon617 made their first contribution in #938
@Tsumugii24 made their first contribution in #963
@xin8coder made their first contribution in #964

Full Changelog: v1.1.1...v1.2.0

Contributors

penguinwang96825, Yunnglin, and 5 other contributors

Assets 2