Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Reproduce MILU Benchmark Results from Paper #1

Open
abhinand5 opened this issue Nov 12, 2024 · 3 comments
Open

Unable to Reproduce MILU Benchmark Results from Paper #1

abhinand5 opened this issue Nov 12, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@abhinand5
Copy link

abhinand5 commented Nov 12, 2024

Problem

I'm encountering discrepancies while trying to reproduce the benchmark results reported in the paper (page 6, table 3) for Indic languages. Using sarvamai/sarvam-1 model from Hugging Face, I'm observing a consistent ~-3% gap in performance metrics.

Note: I've run additional tests with meta-llama/Llama-3.2-1B-Instruct and was able to successfully reproduce the scores reported in the paper (±0.1%) for this model. This suggests the discrepancy might be specific to the sarvam-1 model rather than a systematic issue with the benchmark implementation.

Environment

PyTorch: 2.4.0
CUDA: 12.4.1
OS: Ubuntu 22.04
Transformers: 4.46.2
GPU: A40 48GB

Results from paper

image

Current Results

Tasks Version Filter n-shot Metric Value Stderr
milu 0 none acc 0.2941 ± 0.0016
- milu_Bengali 0 none 5 acc 0.2879 ± 0.0054
- milu_English 0 none 5 acc 0.3275 ± 0.0040
- milu_Gujarati 0 none 5 acc 0.2943 ± 0.0062
- milu_Hindi 0 none 5 acc 0.2946 ± 0.0037
- milu_Kannada 0 none 5 acc 0.2908 ± 0.0055
- milu_Malayalam 0 none 5 acc 0.2739 ± 0.0065
- milu_Marathi 0 none 5 acc 0.2868 ± 0.0052
- milu_Odia 0 none 5 acc 0.2919 ± 0.0064
- milu_Punjabi 0 none 5 acc 0.2837 ± 0.0068
- milu_Tamil 0 none 5 acc 0.2757 ± 0.0053
- milu_Telugu 0 none 5 acc 0.2839 ± 0.0051
Groups Version Filter n-shot Metric Value Stderr
milu 0 none acc 0.2941 ± 0.0016

What I've Tried

  • Multiple runs with different batch sizes
  • Different random seeds
  • Modified generation parameters

None of these attempts have significantly affected the results or closed the performance gap.

Additional Context

  • Model used: sarvamai/sarvam-1
  • The ~-3% gap appears too large to be attributed to typical variations in seed or batch size
  • Maybe the difference in few-shot examples are causing this? (I ran the benchmarks >10 times btw, the scores were similar everytime)

Query

Could you help identify what might be causing this significant discrepancy between my results and those reported in the paper?

@Sshubam
Copy link
Collaborator

Sshubam commented Nov 16, 2024

We can confirm that the Sarvam-1 model was updated on November 8th, following our evaluation.

Attaching the reference for the same.
Screenshot 2024-11-16 at 12 18 27 PM

@abhinand5
Copy link
Author

Alright that might explain the difference. But a ~3% drop after an update? Which version (commit id) of the model did you use? I'll rerun it and let you know.

@Sshubam
Copy link
Collaborator

Sshubam commented Nov 16, 2024

As per the HuggingFace commit history of the model, I think this commit id will
have the older version of the model : af02628dcadf4f5ad1c0f71b23cd2e28af466b81

@Sshubam Sshubam self-assigned this Nov 16, 2024
@Sshubam Sshubam added the question Further information is requested label Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants