Skip to content

Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) #135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dkimds
Copy link

@dkimds dkimds commented Jul 1, 2025

Summary

Fixes #134 by adding debug mode support to benchmarks that were missing it.

Changes

  • ✅ AIME24: Added debug slicing [:2]
  • ✅ AIME25: Added debug slicing [:2]
  • ✅ AIW: Added debug slicing [:2]
  • ✅ AMC23: Added debug slicing [:2]
  • ✅ HMMT: Added debug slicing [:2]
  • ✅ MATH500: Added debug slicing [:2]

Testing

# Before: These would fail or ignore debug flag
python -m eval.eval --model hf --tasks AIME24 --debug --model_args "pretrained=microsoft/DialoGPT-medium"

# After: All work with 5 examples max
python -m eval.eval --model hf --tasks AIME24,AIME25,AIW,AMC23,HMMT,MATH500 --debug --model_args "pretrained=microsoft/DialoGPT-medium"

Implementation Pattern

Following the established pattern from MTBench and other working benchmarks:
python

if self.debug:
    examples = examples[:5]

Impact

✅ Consistent debug behavior across all benchmarks
✅ Faster development iteration (5 examples vs full dataset)
✅ Reduced compute costs during testing
✅ No breaking changes to existing functionality

AIME24, AIME25, AIW, AMC23, HMMT, MATH500
@dkimds
Copy link
Author

dkimds commented Jul 4, 2025

Hi @neginraoof, I noticed you recently reviewed a merged PR. Would you be able to take a look at my PR as well when you have some time? I’d really appreciate your feedback. Thanks!

@dkimds dkimds changed the title Add debug mode to 5 benchmarks Add debug mode support to 6 benchmarks (AIME24, AIME25, AIW, AMC23, HMMT, MATH500) Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Debug mode fails for AIME24 and 5 other benchmarks
1 participant