Skip to content

Releases: HAE-RAE/haerae-evaluation-toolkit

v0.1.0pre

07 Jun 12:19
099b37a
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.0dev...v0.1.0pre

v0.1.0dev

08 Apr 01:49
Compare
Choose a tag to compare

What's Changed

  • Update README.md by @Dasol-Choi in #141
  • bugfix - openai compatible api backends

Full Changelog: v0.1.0...v0.1.0dev

v0.1.0beta

06 Apr 02:26
af50ac8
Compare
Choose a tag to compare
v0.1.0beta Pre-release
Pre-release

HRET (Haerae-Evaluation-Toolkit) v0.1.0 Release Notes

Release Date: April 6, 2025

Announcing the first public release of the Haerae-Evaluation-Toolkit (HRET) v0.1.0! HRET is an open-source Python library designed to streamline and standardize the evaluation of Large Language Models (LLMs), with a particular focus on Korean.

This initial release provides core functionalities for LLM evaluation, including diverse benchmarks, flexible backends, and advanced inference techniques.

✨ Highlights

  • Standardized Evaluation Pipeline: Unified workflow for dataset loading, model inference, scaling, and evaluation (PipelineRunner, Evaluator).
  • Versatile Evaluation Metrics: Supports String Match, Partial Match, Log Likelihood, Math Equivalence, and LLM-as-a-Judge.
  • Advanced Inference Techniques: Includes test-time scaling methods like Best-of-N, Beam Search, and Self-Consistency.
  • Flexible Backend Integration: Supports Hugging Face Transformers, OpenAI-compatible APIs (e.g., vLLM), and LiteLLM (over 30 providers).
  • Korean Benchmarks Included: Built-in loaders for HAE-RAE Bench, KMMLU, KUDGE, HRM8K, CLIcK, K2-Eval, and more.
  • LLM-as-a-Judge & Reward Models: Integrate separate Judge/Reward models using MultiModel.
  • Korean Language Focus: Includes a language penalizer for evaluating response consistency.
  • CLI & Python API: Offers both a command-line interface and a Python API for ease of use.
  • Custom CoT Parser: Supports integration of custom Chain-of-Thought parsers.

🚀 What's Included

  • Evaluation Methods (llm_eval.evaluation):
    • StringMatchEvaluator, PartialMatchEvaluator
    • MathMatchEvaluator (using math_verify)
    • LogProbEvaluator
    • LLMJudgeEvaluator
  • Scaling Methods (llm_eval.scaling_methods):
    • BestOfN, BeamSearch, SelfConsistencyScalingMethod
  • Model Backends (llm_eval.models):
    • Hugging Face (HuggingFaceModel, HuggingFaceJudge, HuggingFaceReward)
    • OpenAI-Compatible (OpenAIModel, OpenAIJudge)
    • LiteLLM (LiteLLMBackend, LiteLLMJudge)
    • MultiModel for combining backends
  • Dataset Loaders (llm_eval.datasets):
    • HAE-RAE Bench, KMMLU, CLIcK, HRM8K, K2-Eval, KUDGE
    • GenericFileDataset for local files (CSV, Parquet, XLSX)
  • Utilities (llm_eval.utils): Logging, language penalizer, prompt templates, CoT parser utilities.
  • Documentation & Tutorials: Korean/English tutorials and Sphinx API docs.
  • Installation: Support for both pip and uv.