Skip to content

v1.2.0

Latest

Choose a tag to compare

@Yunnglin Yunnglin released this 11 Nov 04:58

中文版

基准测试数据集

  • 新增多个MCQA(多项选择问答)数据集
  • 新增Drivelology基准测试
  • 更新BFCL-v3,新增支持BFCL-v4基准测试
  • 更新tau-bench,新增支持tau2-bench
  • 支持WMT机器翻译评测和相关指标

功能增强

  • 优化答案提取机制 - 使答案提取过程更加明确和可控
  • 支持batch计算指标,例如Bertscore等
  • 更新聚合评分功能 - 新增pass@k、vote@k、pass^k等指标聚合
  • 更新OpenAI API参数 - 优化API调用参数配置

数据源更新

  • 更新SimpleQA数据源 - 使用最新的SimpleQA数据
  • 对齐AIME到AA标准 - 统一评测标准
  • 更新MMLU-Pro - 使用最新的MMLU-Pro数据

问题修复

  • 修复DROP数据集few_shot_num=3的问题
  • 修复缓冲区解码错误 - 解决了decode buffer相关的错误

English Version

Benchmark Datasets

  • Added multiple MCQA (Multiple Choice Question Answering) datasets
  • Added Drivelology benchmark
  • Updated BFCL-v3 and added support for BFCL-v4 benchmark
  • Updated tau-bench and added support for tau2-bench
  • Added support for WMT machine translation evaluation and related metrics

Feature Enhancements

  • Optimized answer extraction mechanism - making the answer extraction process more explicit and controllable
  • Added support for batch metric computation, such as Bertscore
  • Updated aggregate scoring functionality - added metric aggregations including pass@k, vote@k, pass^k, etc.
  • Updated OpenAI API parameters - optimized API call parameter configuration

Data Source Updates

  • Updated SimpleQA data source - using the latest SimpleQA data
  • Aligned AIME to AA standard - unified evaluation standards
  • Updated MMLU-Pro - using the latest MMLU-Pro data

Bug Fixes

  • Fixed the issue with DROP dataset when few_shot_num=3
  • Fixed buffer decoding error - resolved decode buffer related issues

What's Changed

New Contributors

Full Changelog: v1.1.1...v1.2.0