fix a few benchmark such that importing any of them works properly #127

jmercat · 2025-06-03T01:45:10Z

When trying to load all benchmarks I found a few issues that this should fix.

Here is an import code that reveals the issues that this PR fixes:

import importlib.util
import inspect
import os
import sys
from typing import Dict, Type

# Store all benchmark classes
benchmark_classes: Dict[str, Type] = {}


def import_benchmark(benchmark_name: str):
    """
    Dynamically import a benchmark class by temporarily adding its directory to sys.path.
    This mimics how the TaskManager handles imports to avoid relative import issues.
    """
    benchmarks_dir = "eval/chat_benchmarks"
    benchmark_path = os.path.join(benchmarks_dir, benchmark_name)
    eval_instruct_path = os.path.join(benchmark_path, "eval_instruct.py")

    if not os.path.exists(eval_instruct_path):
        print(f"Warning: eval_instruct.py not found in {benchmark_name}")
        return None

    try:
        # Temporarily add the benchmark directory to sys.path
        sys.path.insert(0, benchmark_path)

        # Import the module
        spec = importlib.util.spec_from_file_location(
            f"eval.chat_benchmarks.{benchmark_name}.eval_instruct", eval_instruct_path
        )
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)

        # Remove the path we added
        sys.path.pop(0)

        # Find benchmark classes in the module
        from eval.task import BaseBenchmark

        benchmark_classes_found = [
            cls
            for _, cls in inspect.getmembers(module, inspect.isclass)
            if (
                issubclass(cls, BaseBenchmark)
                and cls != BaseBenchmark
                and cls.__module__.replace(".", "/") in eval_instruct_path
            )
        ]

        if benchmark_classes_found:
            benchmark_class = benchmark_classes_found[0]
            benchmark_classes[benchmark_name] = benchmark_class
            # print(f"Successfully imported {benchmark_class.__name__} from {benchmark_name}")
            return benchmark_class
        else:
            print(f"Warning: No BaseBenchmark subclass found in {benchmark_name}")
            return None

    except Exception as e:
        print(f"Error importing {benchmark_name}: {str(e)}")
        return None


# List of all benchmark directories
benchmark_names = [
    "AIME24",
    "AIME25",
    "AIW",
    "alpaca_eval",
    "AMC23",
    "BigCodeBench",
    "CodeElo",
    "CodeForces",
    "CruxEval",
    "GPQADiamond",
    "HLE",
    "HMMT",
    "HumanEval",
    "HumanEvalPlus",
    "IFEval",
    "JEEBench",
    "LiveBench",
    "LiveCodeBench",
    "LiveCodeBenchv5",
    "MATH500",
    "MBPP",
    "MBPPPlus",
    "MixEval",
    "MMLUPro",
    "MTBench",
    "MultiPLE",
    "RepoBench",
    "SWEbench",
    "WildBench",
    "zeroeval",
]

# Import all benchmarks
print("Importing all benchmarks...")
for benchmark_name in benchmark_names:
    import_benchmark(benchmark_name)

print(f"\nSuccessfully imported {len(benchmark_classes)} benchmarks:")
for name, cls in benchmark_classes.items():
    print(f"  {name}: {cls.__name__}")

jmercat · 2025-06-03T01:52:41Z

The linter didn't pass for some files that I didn't change. I linted them but now it obfuscates the PR a bit. I recommend looking at the first commit only.

neginraoof · 2025-06-03T01:54:21Z

Amazing! Thanks @jmercat !

neginraoof · 2025-06-03T01:55:56Z

@jmercat i can think max_tokens could still be None because of this line:

evalchemy/eval/eval.py

Line 113 in 522f1f4

parser.add_argument(

And we cannot really set a default value here

neginraoof · 2025-06-03T01:59:12Z

eval/chat_benchmarks/LiveCodeBench/eval_instruct.py

@@ -67,7 +66,7 @@ def __init__(
        """
        super().__init__(logger=logger, system_instruction=system_instruction)
        self.debug = debug
-        self.max_new_tokens = max_tokens if max_tokens is not None else 32768 # set higher to avoid truncation for reasoning models
+        self.max_new_tokens = max_tokens


Don't we still need to check if max_tokens is not None ?

oh ok that's not the best way to handle it I think. I'll revert for now but we should probably not send an argument as none if we don't want it to be none.

fix a few benchmark such that importing any of them works properly

0e5bd07

jmercat requested a review from neginraoof June 3, 2025 01:45

jmercat added 3 commits June 2, 2025 18:45

ran black formatter

f3fe361

ran black[colorama]==23.1.0 formatting

61f42c9

hand formatted last issue

25bc122

neginraoof reviewed Jun 3, 2025

View reviewed changes

set back default max_tokens when none

52526d1

jmercat mentioned this pull request Jun 3, 2025

default max tokens #128

Merged

resolve conflicts

05949c4

neginraoof merged commit cc75611 into main Jun 5, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix a few benchmark such that importing any of them works properly #127

fix a few benchmark such that importing any of them works properly #127

Uh oh!

jmercat commented Jun 3, 2025

Uh oh!

jmercat commented Jun 3, 2025 •

edited

Loading

Uh oh!

neginraoof commented Jun 3, 2025

Uh oh!

neginraoof commented Jun 3, 2025 •

edited

Loading

Uh oh!

neginraoof Jun 3, 2025

Uh oh!

jmercat Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

fix a few benchmark such that importing any of them works properly #127

fix a few benchmark such that importing any of them works properly #127

Uh oh!

Conversation

jmercat commented Jun 3, 2025

Uh oh!

jmercat commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neginraoof commented Jun 3, 2025

Uh oh!

neginraoof commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neginraoof Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

jmercat Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jmercat commented Jun 3, 2025 •

edited

Loading

neginraoof commented Jun 3, 2025 •

edited

Loading