Releases: HAE-RAE/haerae-evaluation-toolkit
Releases · HAE-RAE/haerae-evaluation-toolkit
v0.1.0pre
What's Changed
- do_sample add by @tryumanshow in #142
- fix typo in 01-quick-start.md by @csy1204 in #144
- major update by @h-albert-lee in #145
- add distributed inference by @h-albert-lee in #146
- add benchhub (testing phase) by @h-albert-lee in #147
- add benchhub info class by @h-albert-lee in #148
- add benchhub info into result class by @h-albert-lee in #149
- add benchhub features and get ready for pypi deploy by @h-albert-lee in #150
- updated docs by @h-albert-lee in #151
New Contributors
- @tryumanshow made their first contribution in #142
- @csy1204 made their first contribution in #144
Full Changelog: v0.1.0dev...v0.1.0pre
v0.1.0dev
What's Changed
- Update README.md by @Dasol-Choi in #141
- bugfix - openai compatible api backends
Full Changelog: v0.1.0...v0.1.0dev
v0.1.0beta
HRET (Haerae-Evaluation-Toolkit) v0.1.0 Release Notes
Release Date: April 6, 2025
Announcing the first public release of the Haerae-Evaluation-Toolkit (HRET) v0.1.0! HRET is an open-source Python library designed to streamline and standardize the evaluation of Large Language Models (LLMs), with a particular focus on Korean.
This initial release provides core functionalities for LLM evaluation, including diverse benchmarks, flexible backends, and advanced inference techniques.
✨ Highlights
- Standardized Evaluation Pipeline: Unified workflow for dataset loading, model inference, scaling, and evaluation (
PipelineRunner
,Evaluator
). - Versatile Evaluation Metrics: Supports String Match, Partial Match, Log Likelihood, Math Equivalence, and LLM-as-a-Judge.
- Advanced Inference Techniques: Includes test-time scaling methods like Best-of-N, Beam Search, and Self-Consistency.
- Flexible Backend Integration: Supports Hugging Face Transformers, OpenAI-compatible APIs (e.g., vLLM), and LiteLLM (over 30 providers).
- Korean Benchmarks Included: Built-in loaders for HAE-RAE Bench, KMMLU, KUDGE, HRM8K, CLIcK, K2-Eval, and more.
- LLM-as-a-Judge & Reward Models: Integrate separate Judge/Reward models using
MultiModel
. - Korean Language Focus: Includes a language penalizer for evaluating response consistency.
- CLI & Python API: Offers both a command-line interface and a Python API for ease of use.
- Custom CoT Parser: Supports integration of custom Chain-of-Thought parsers.
🚀 What's Included
- Evaluation Methods (
llm_eval.evaluation
):StringMatchEvaluator
,PartialMatchEvaluator
MathMatchEvaluator
(usingmath_verify
)LogProbEvaluator
LLMJudgeEvaluator
- Scaling Methods (
llm_eval.scaling_methods
):BestOfN
,BeamSearch
,SelfConsistencyScalingMethod
- Model Backends (
llm_eval.models
):- Hugging Face (
HuggingFaceModel
,HuggingFaceJudge
,HuggingFaceReward
) - OpenAI-Compatible (
OpenAIModel
,OpenAIJudge
) - LiteLLM (
LiteLLMBackend
,LiteLLMJudge
) MultiModel
for combining backends
- Hugging Face (
- Dataset Loaders (
llm_eval.datasets
):- HAE-RAE Bench, KMMLU, CLIcK, HRM8K, K2-Eval, KUDGE
GenericFileDataset
for local files (CSV, Parquet, XLSX)
- Utilities (
llm_eval.utils
): Logging, language penalizer, prompt templates, CoT parser utilities. - Documentation & Tutorials: Korean/English tutorials and Sphinx API docs.
- Installation: Support for both
pip
anduv
.