Skip to content

fix a few benchmark such that importing any of them works properly #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 5, 2025

Conversation

jmercat
Copy link
Collaborator

@jmercat jmercat commented Jun 3, 2025

When trying to load all benchmarks I found a few issues that this should fix.

Here is an import code that reveals the issues that this PR fixes:

import importlib.util
import inspect
import os
import sys
from typing import Dict, Type

# Store all benchmark classes
benchmark_classes: Dict[str, Type] = {}


def import_benchmark(benchmark_name: str):
    """
    Dynamically import a benchmark class by temporarily adding its directory to sys.path.
    This mimics how the TaskManager handles imports to avoid relative import issues.
    """
    benchmarks_dir = "eval/chat_benchmarks"
    benchmark_path = os.path.join(benchmarks_dir, benchmark_name)
    eval_instruct_path = os.path.join(benchmark_path, "eval_instruct.py")

    if not os.path.exists(eval_instruct_path):
        print(f"Warning: eval_instruct.py not found in {benchmark_name}")
        return None

    try:
        # Temporarily add the benchmark directory to sys.path
        sys.path.insert(0, benchmark_path)

        # Import the module
        spec = importlib.util.spec_from_file_location(
            f"eval.chat_benchmarks.{benchmark_name}.eval_instruct", eval_instruct_path
        )
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)

        # Remove the path we added
        sys.path.pop(0)

        # Find benchmark classes in the module
        from eval.task import BaseBenchmark

        benchmark_classes_found = [
            cls
            for _, cls in inspect.getmembers(module, inspect.isclass)
            if (
                issubclass(cls, BaseBenchmark)
                and cls != BaseBenchmark
                and cls.__module__.replace(".", "/") in eval_instruct_path
            )
        ]

        if benchmark_classes_found:
            benchmark_class = benchmark_classes_found[0]
            benchmark_classes[benchmark_name] = benchmark_class
            # print(f"Successfully imported {benchmark_class.__name__} from {benchmark_name}")
            return benchmark_class
        else:
            print(f"Warning: No BaseBenchmark subclass found in {benchmark_name}")
            return None

    except Exception as e:
        print(f"Error importing {benchmark_name}: {str(e)}")
        return None


# List of all benchmark directories
benchmark_names = [
    "AIME24",
    "AIME25",
    "AIW",
    "alpaca_eval",
    "AMC23",
    "BigCodeBench",
    "CodeElo",
    "CodeForces",
    "CruxEval",
    "GPQADiamond",
    "HLE",
    "HMMT",
    "HumanEval",
    "HumanEvalPlus",
    "IFEval",
    "JEEBench",
    "LiveBench",
    "LiveCodeBench",
    "LiveCodeBenchv5",
    "MATH500",
    "MBPP",
    "MBPPPlus",
    "MixEval",
    "MMLUPro",
    "MTBench",
    "MultiPLE",
    "RepoBench",
    "SWEbench",
    "WildBench",
    "zeroeval",
]

# Import all benchmarks
print("Importing all benchmarks...")
for benchmark_name in benchmark_names:
    import_benchmark(benchmark_name)

print(f"\nSuccessfully imported {len(benchmark_classes)} benchmarks:")
for name, cls in benchmark_classes.items():
    print(f"  {name}: {cls.__name__}")

@jmercat jmercat requested a review from neginraoof June 3, 2025 01:45
@jmercat
Copy link
Collaborator Author

jmercat commented Jun 3, 2025

The linter didn't pass for some files that I didn't change. I linted them but now it obfuscates the PR a bit. I recommend looking at the first commit only.

@neginraoof
Copy link
Collaborator

Amazing! Thanks @jmercat !

@neginraoof
Copy link
Collaborator

neginraoof commented Jun 3, 2025

@jmercat i can think max_tokens could still be None because of this line:

parser.add_argument(

And we cannot really set a default value here

@@ -67,7 +66,7 @@ def __init__(
"""
super().__init__(logger=logger, system_instruction=system_instruction)
self.debug = debug
self.max_new_tokens = max_tokens if max_tokens is not None else 32768 # set higher to avoid truncation for reasoning models
self.max_new_tokens = max_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need to check if max_tokens is not None ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok that's not the best way to handle it I think. I'll revert for now but we should probably not send an argument as none if we don't want it to be none.

@jmercat jmercat mentioned this pull request Jun 3, 2025
@neginraoof neginraoof merged commit cc75611 into main Jun 5, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants