Implement leaderboard as a benchmark #234

RobotSail · 2025-03-17T13:43:19Z

This PR contributes the Open LLM Leaderboard v2 to become an evaluation exposed within instructlab/eval.

In particular, this exposes leaderboard with the ability for users to select a subset of the tasks in leaderboard.

In addition, this benchmark is implemented in a way such that it runs each subtask on the most optimal inference backend for a given task.

Specifically, MCQ-style tasks (GPQA, MUSR, MMLU-Pro, and BBH) are executed directly through regular HF transformers, whereas generative tasks (IFEval and MATH-Hard) get executed through vLLM.

bbrowning

Assuming, the failing unit test around StrEnum import is figured out, this looks good. The new code is purely additive, and the extra requirements are in a separate requirements txt file so it shouldn't cause any issues with downstream builds.

…but this brings the core idea Signed-off-by: Oleg Silkin <[email protected]>

Signed-off-by: Oleg Silkin <[email protected]>

…ptions for the `simple_evaluate` function Signed-off-by: Oleg Silkin <[email protected]>

Signed-off-by: Oleg Silkin <[email protected]>

…uctlab-eval[leaderboard] Signed-off-by: Oleg Silkin <[email protected]>

Signed-off-by: Oleg Silkin <[email protected]>

mergify bot added the ci-failure label Mar 17, 2025

RobotSail force-pushed the add-leaderboard branch from 121a86f to 4806e00 Compare March 17, 2025 13:44

mergify bot added dependencies Pull requests that update a dependency file ci-failure documentation Improvements or additions to documentation and removed ci-failure labels Mar 17, 2025

bbrowning approved these changes Apr 15, 2025

View reviewed changes

mergify bot added the one-approval label Apr 15, 2025

RobotSail force-pushed the add-leaderboard branch from 1cb9078 to 0d5a983 Compare April 16, 2025 03:40

mergify bot added the testing Relates to testing label Apr 16, 2025

RobotSail added 9 commits April 16, 2025 03:40

initial implementation of leaderboard. Lots of stuff can be improved …

1892a79

…but this brings the core idea Signed-off-by: Oleg Silkin <[email protected]>

formatting

15e9f75

Signed-off-by: Oleg Silkin <[email protected]>

fix saving and add test script

e2b41bd

Signed-off-by: Oleg Silkin <[email protected]>

fix mypy errors

b43f697

Signed-off-by: Oleg Silkin <[email protected]>

enable users to override the default vLLM + HF settings, as well as o…

bd95672

…ptions for the `simple_evaluate` function Signed-off-by: Oleg Silkin <[email protected]>

make cache_requests be a eval_config

a257e92

Signed-off-by: Oleg Silkin <[email protected]>

enable leaderboard to run with a remote openai provider

aa573d9

Signed-off-by: Oleg Silkin <[email protected]>

make the leaderboard dependencies into an optional target under instr…

ce38464

…uctlab-eval[leaderboard] Signed-off-by: Oleg Silkin <[email protected]>

push up evaluation script

cd47eaa

Signed-off-by: Oleg Silkin <[email protected]>

RobotSail force-pushed the add-leaderboard branch from 0d5a983 to 6651188 Compare April 16, 2025 03:40

mergify bot added ci-failure and removed ci-failure labels Apr 16, 2025

RobotSail force-pushed the add-leaderboard branch from 6651188 to 1e40fc6 Compare April 16, 2025 04:01

mergify bot added CI/CD Affects CI/CD configuration ci-failure and removed ci-failure labels Apr 16, 2025

add requirements file for leaderboard

66fb8bb

Signed-off-by: Oleg Silkin <[email protected]>

RobotSail force-pushed the add-leaderboard branch from 1e40fc6 to 66fb8bb Compare April 16, 2025 04:24

mergify bot removed the ci-failure label Apr 16, 2025

RobotSail merged commit cea8acd into instructlab:main Apr 16, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement leaderboard as a benchmark #234

Implement leaderboard as a benchmark #234

RobotSail commented Mar 17, 2025

bbrowning left a comment

Implement leaderboard as a benchmark #234

Implement leaderboard as a benchmark #234

Conversation

RobotSail commented Mar 17, 2025

bbrowning left a comment

Choose a reason for hiding this comment