Add LongBench V2 benchmark #249

eshwarprasadS · 2025-04-30T18:08:07Z

Adding LongBench to eval options,

Install extras with:

pip install instructlab-eval[longbench]

Uses VLLM backend for serving the model for generation

Runs like so:

evaluator = LongBenchEvaluator(
    model_path="path/to/model",
    num_gpus=N,
    output_file="path/to/results.json",
    eval_config={"batch_size": "auto"},
    vllm_config={"max_model_len": max_len}
)

results = evaluator.run()  # Returns LongBenchResult

Output json looks like so:

{
  "en_multidoc": 0.5424139838230786,
  "zh_multidoc": 0.24335639081098673,
  "en_singledoc": 0.4233139199560039,
  "zh_singledoc": 0.46157875457875464,
  "en_summ": 0.27244809337990245,
  "zh_summ": 0.1359562304911904,
  "en_fewshot": 0.45692449627485754,
  "zh_fewshot": 0.24416666666666667,
  "en_synthetic": 0.3799285714285714,
  "zh_synthetic": 0.4775,
  "code_avg": 0.30225,
  "overall_score": 0.3581670097645466
}

RobotSail

Thanks for the PR @eshwarprasadS !

The PR has all of the right ideas, there are just a few minor changes that you'll want to make which I've outlined in this review. Once we've addressed those, this should be good to merge

requirements-longbench.txt

src/instructlab/eval/longbench.py

RobotSail · 2025-05-01T03:44:05Z

src/instructlab/eval/longbench.py

+        ) / 2
+
+        # Calculate overall score
+        all_scores = [v for k, v in eval_results.items() if k != "overall_score"]


Why do we check if k != "overall_score"? We shouldn't have set this key yet

src/instructlab/eval/longbench.py

RobotSail · 2025-05-25T04:44:41Z

@eshwarprasadS It looks like you may need to rebase your changes

RobotSail · 2025-06-02T06:41:18Z

@mergify rebase

Signed-off-by: eshwarprasadS <[email protected]>

…-cuda extras Signed-off-by: eshwarprasadS <[email protected]>

…y served openai-compatible model endpoints Signed-off-by: eshwarprasadS <[email protected]>

… name parameter Signed-off-by: eshwarprasadS <[email protected]>

Signed-off-by: eshwarprasadS <[email protected]>

mergify · 2025-06-02T06:41:40Z

rebase

✅ Branch has been successfully rebased

mergify · 2025-06-02T15:45:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. @eshwarprasadS please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

RobotSail · 2025-06-02T15:45:42Z

@eshwarprasadS It looks like you have a few merge conflicts that need to be fixed. Once those are solved, we can merge this.

Signed-off-by: Eshwar Prasad Sivaramakrishnan <[email protected]>

mergify bot added dependencies Pull requests that update a dependency file ci-failure labels Apr 30, 2025

RobotSail reviewed May 1, 2025

View reviewed changes

mergify bot added ci-failure and removed ci-failure labels May 1, 2025

mergify bot added ci-failure testing Relates to testing and removed ci-failure labels May 9, 2025

mergify bot removed the ci-failure label May 25, 2025

mergify bot added ci-failure and removed ci-failure labels May 25, 2025

mergify bot added ci-failure and removed ci-failure labels Jun 2, 2025

eshwarprasadS added 11 commits June 2, 2025 06:41

init add longbench benchmark implementation

5e9c2af

Signed-off-by: eshwarprasadS <[email protected]>

add requirements file, fix optional dependency options, lint

1804315

Signed-off-by: eshwarprasadS <[email protected]>

feat: add requirements-cuda, move vllm and flash-attn to requirements…

90c2ee1

…-cuda extras Signed-off-by: eshwarprasadS <[email protected]>

fix: fix openai-chat-completions model for enabling external / locall…

27f04f3

…y served openai-compatible model endpoints Signed-off-by: eshwarprasadS <[email protected]>

fix: change model backend to local completions, add support for model…

f57c8b1

… name parameter Signed-off-by: eshwarprasadS <[email protected]>

fix: linting...

3a65ece

Signed-off-by: eshwarprasadS <[email protected]>

linting changes...

bbdebc4

Signed-off-by: eshwarprasadS <[email protected]>

fix: type changes and linting

64e3893

Signed-off-by: eshwarprasadS <[email protected]>

add evaluator registration to evaluator group in unit test

a36c3aa

Signed-off-by: eshwarprasadS <[email protected]>

typing fixes to accommodate API call args, linting...

df9543c

Signed-off-by: eshwarprasadS <[email protected]>

fix: typing for py3.10

a0c7e5f

Signed-off-by: eshwarprasadS <[email protected]>

RobotSail force-pushed the add-longbench branch from ba0ec6c to a0c7e5f Compare June 2, 2025 06:41

mergify bot removed the ci-failure label Jun 2, 2025

mergify bot added the needs-rebase label Jun 2, 2025

Merge branch 'main' into add-longbench

5ba7e6e

Signed-off-by: Eshwar Prasad Sivaramakrishnan <[email protected]>

mergify bot removed the needs-rebase label Jun 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LongBench V2 benchmark #249

Add LongBench V2 benchmark #249

Uh oh!

eshwarprasadS commented Apr 30, 2025 •

edited

Loading

Uh oh!

RobotSail left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobotSail May 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobotSail commented May 25, 2025

Uh oh!

RobotSail commented Jun 2, 2025

Uh oh!

mergify bot commented Jun 2, 2025

Uh oh!

mergify bot commented Jun 2, 2025

Uh oh!

RobotSail commented Jun 2, 2025

Uh oh!

Uh oh!

Add LongBench V2 benchmark #249

Are you sure you want to change the base?

Add LongBench V2 benchmark #249

Uh oh!

Conversation

eshwarprasadS commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobotSail May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobotSail commented May 25, 2025

Uh oh!

RobotSail commented Jun 2, 2025

Uh oh!

mergify bot commented Jun 2, 2025

✅ Branch has been successfully rebased

Uh oh!

mergify bot commented Jun 2, 2025

Uh oh!

RobotSail commented Jun 2, 2025

Uh oh!

Uh oh!

eshwarprasadS commented Apr 30, 2025 •

edited

Loading