HRET (Haerae-Evaluation-Toolkit) v0.1.0 Release Notes

Release Date: April 6, 2025

Announcing the first public release of the Haerae-Evaluation-Toolkit (HRET) v0.1.0! HRET is an open-source Python library designed to streamline and standardize the evaluation of Large Language Models (LLMs), with a particular focus on Korean.

This initial release provides core functionalities for LLM evaluation, including diverse benchmarks, flexible backends, and advanced inference techniques.

✨ Highlights

Standardized Evaluation Pipeline: Unified workflow for dataset loading, model inference, scaling, and evaluation (PipelineRunner, Evaluator).
Versatile Evaluation Metrics: Supports String Match, Partial Match, Log Likelihood, Math Equivalence, and LLM-as-a-Judge.
Advanced Inference Techniques: Includes test-time scaling methods like Best-of-N, Beam Search, and Self-Consistency.
Flexible Backend Integration: Supports Hugging Face Transformers, OpenAI-compatible APIs (e.g., vLLM), and LiteLLM (over 30 providers).
Korean Benchmarks Included: Built-in loaders for HAE-RAE Bench, KMMLU, KUDGE, HRM8K, CLIcK, K2-Eval, and more.
LLM-as-a-Judge & Reward Models: Integrate separate Judge/Reward models using MultiModel.
Korean Language Focus: Includes a language penalizer for evaluating response consistency.
CLI & Python API: Offers both a command-line interface and a Python API for ease of use.
Custom CoT Parser: Supports integration of custom Chain-of-Thought parsers.

🚀 What's Included

Evaluation Methods (llm_eval.evaluation):
- StringMatchEvaluator, PartialMatchEvaluator
- MathMatchEvaluator (using math_verify)
- LogProbEvaluator
- LLMJudgeEvaluator
Scaling Methods (llm_eval.scaling_methods):
- BestOfN, BeamSearch, SelfConsistencyScalingMethod
Model Backends (llm_eval.models):
- Hugging Face (HuggingFaceModel, HuggingFaceJudge, HuggingFaceReward)
- OpenAI-Compatible (OpenAIModel, OpenAIJudge)
- LiteLLM (LiteLLMBackend, LiteLLMJudge)
- MultiModel for combining backends
Dataset Loaders (llm_eval.datasets):
- HAE-RAE Bench, KMMLU, CLIcK, HRM8K, K2-Eval, KUDGE
- GenericFileDataset for local files (CSV, Parquet, XLSX)
Utilities (llm_eval.utils): Logging, language penalizer, prompt templates, CoT parser utilities.
Documentation & Tutorials: Korean/English tutorials and Sphinx API docs.
Installation: Support for both pip and uv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

HRET (Haerae-Evaluation-Toolkit) v0.1.0 Release Notes

✨ Highlights

🚀 What's Included

Uh oh!

Releases: HAE-RAE/haerae-evaluation-toolkit

v0.1.0pre

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.0dev

What's Changed

Contributors

Uh oh!

v0.1.0beta

HRET (Haerae-Evaluation-Toolkit) v0.1.0 Release Notes

✨ Highlights

🚀 What's Included

Uh oh!