You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering discrepancies while trying to reproduce the benchmark results reported in the paper (page 6, table 3) for Indic languages. Using sarvamai/sarvam-1 model from Hugging Face, I'm observing a consistent ~-3% gap in performance metrics.
Note: I've run additional tests with meta-llama/Llama-3.2-1B-Instruct and was able to successfully reproduce the scores reported in the paper (±0.1%) for this model. This suggests the discrepancy might be specific to the sarvam-1 model rather than a systematic issue with the benchmark implementation.
Alright that might explain the difference. But a ~3% drop after an update? Which version (commit id) of the model did you use? I'll rerun it and let you know.
As per the HuggingFace commit history of the model, I think this commit id will
have the older version of the model : af02628dcadf4f5ad1c0f71b23cd2e28af466b81
Problem
I'm encountering discrepancies while trying to reproduce the benchmark results reported in the paper (page 6, table 3) for Indic languages. Using
sarvamai/sarvam-1
model from Hugging Face, I'm observing a consistent ~-3%
gap in performance metrics.Note: I've run additional tests with meta-llama/Llama-3.2-1B-Instruct and was able to successfully reproduce the scores reported in the paper (±0.1%) for this model. This suggests the discrepancy might be specific to the sarvam-1 model rather than a systematic issue with the benchmark implementation.
Environment
PyTorch: 2.4.0
CUDA: 12.4.1
OS: Ubuntu 22.04
Transformers: 4.46.2
GPU: A40 48GB
Results from paper
Current Results
What I've Tried
None of these attempts have significantly affected the results or closed the performance gap.
Additional Context
-3%
gap appears too large to be attributed to typical variations in seed or batch sizeQuery
Could you help identify what might be causing this significant discrepancy between my results and those reported in the paper?
The text was updated successfully, but these errors were encountered: