Unable to Reproduce MILU Benchmark Results from Paper #1

abhinand5 · 2024-11-12T03:35:56Z

Problem

I'm encountering discrepancies while trying to reproduce the benchmark results reported in the paper (page 6, table 3) for Indic languages. Using sarvamai/sarvam-1 model from Hugging Face, I'm observing a consistent ~-3% gap in performance metrics.

Note: I've run additional tests with meta-llama/Llama-3.2-1B-Instruct and was able to successfully reproduce the scores reported in the paper (±0.1%) for this model. This suggests the discrepancy might be specific to the sarvam-1 model rather than a systematic issue with the benchmark implementation.

Environment

PyTorch: 2.4.0
CUDA: 12.4.1
OS: Ubuntu 22.04
Transformers: 4.46.2
GPU: A40 48GB

Results from paper

Current Results

Tasks	Filter	n-shot	Metric		Value		Stderr
milu	none		acc	↑	0.2941	±	0.0016
- milu_Bengali	none	5	acc	↑	0.2879	±	0.0054
- milu_English	none	5	acc	↑	0.3275	±	0.0040
- milu_Gujarati	none	5	acc	↑	0.2943	±	0.0062
- milu_Hindi	none	5	acc	↑	0.2946	±	0.0037
- milu_Kannada	none	5	acc	↑	0.2908	±	0.0055
- milu_Malayalam	none	5	acc	↑	0.2739	±	0.0065
- milu_Marathi	none	5	acc	↑	0.2868	±	0.0052
- milu_Odia	none	5	acc	↑	0.2919	±	0.0064
- milu_Punjabi	none	5	acc	↑	0.2837	±	0.0068
- milu_Tamil	none	5	acc	↑	0.2757	±	0.0053
- milu_Telugu	none	5	acc	↑	0.2839	±	0.0051

Groups	Version	Filter	n-shot	Metric		Value		Stderr
milu	0	none		acc	↑	0.2941	±	0.0016

What I've Tried

Multiple runs with different batch sizes
Different random seeds
Modified generation parameters

None of these attempts have significantly affected the results or closed the performance gap.

Additional Context

Model used: sarvamai/sarvam-1
The ~-3% gap appears too large to be attributed to typical variations in seed or batch size
Maybe the difference in few-shot examples are causing this? (I ran the benchmarks >10 times btw, the scores were similar everytime)

Query

Could you help identify what might be causing this significant discrepancy between my results and those reported in the paper?

The text was updated successfully, but these errors were encountered:

Sshubam · 2024-11-16T06:49:43Z

We can confirm that the Sarvam-1 model was updated on November 8th, following our evaluation.

Attaching the reference for the same.

abhinand5 · 2024-11-16T07:32:04Z

Alright that might explain the difference. But a ~3% drop after an update? Which version (commit id) of the model did you use? I'll rerun it and let you know.

Sshubam · 2024-11-16T08:23:25Z

As per the HuggingFace commit history of the model, I think this commit id will
have the older version of the model : af02628dcadf4f5ad1c0f71b23cd2e28af466b81

Sshubam self-assigned this Nov 16, 2024

Sshubam added the question Further information is requested label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Reproduce MILU Benchmark Results from Paper #1

Unable to Reproduce MILU Benchmark Results from Paper #1

abhinand5 commented Nov 12, 2024 •

edited

Loading

Sshubam commented Nov 16, 2024

abhinand5 commented Nov 16, 2024

Sshubam commented Nov 16, 2024

Unable to Reproduce MILU Benchmark Results from Paper #1

Unable to Reproduce MILU Benchmark Results from Paper #1

Comments

abhinand5 commented Nov 12, 2024 • edited Loading

Problem

Environment

Results from paper

Current Results

What I've Tried

Additional Context

Query

Sshubam commented Nov 16, 2024

abhinand5 commented Nov 16, 2024

Sshubam commented Nov 16, 2024

abhinand5 commented Nov 12, 2024 •

edited

Loading