-
Notifications
You must be signed in to change notification settings - Fork 237
[EVAL] Clarification on Reproducing DeepSeek R1 Results with do_sampling=True #631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi ! you can set generation parameters using a config file, see here at the bottom of the page. |
@NathanHB Thank you so much for your reply! I still have one question. If I just want to evaluate the model locally with vllm as the backend, how can I use sampling? |
the same way, if you use temperature, it will activate sampling |
Thank you very much! Just one more question: can sampling be performed when using Accelerate? |
yes ! But we advise on using vllm especially for bigger models |
Thank you so much! [2025-03-24 09:46:45,253] [ WARNING]: context_size + max_new_tokens=11004 which is greater than self.max_length=8192. Truncating context to 0 tokens. (vllm_model.py:266) |
self.max_length is modified using |
Thank you so much! If i want get the results of AIME 2024 cons@64, should I write this in the yaml file? |
yes ! |
though @plaguss will know more than me about this :) |
@NathanHB Thank you for your feedback! I really appreciate you taking the time to investigate this. Regarding pass@K, I've implemented it as shown here, but the runtime appears unchanged. @plaguss Would you have any insights on how we might better capture/interpret the results?
|
This works well for me: #618 (comment) I've been able to reproduce the DeepScaleR pass@1 n=16 results: @plaguss @NathanHB Wonder what y'all think about support for reporting multiple metrics? E.g. when sampling 16 times you should be able to report pass@1, pass@16, and consensus@16 from the same data. Would also love more control over sampling so it's easier to implement custom "reranking" setups like Best-of-N |
@rawsh |
hi @NathanHB, I'm testing AIME pass@1 (n>1) score with script like: https://github.com/huggingface/open-r1?tab=readme-ov-file#aime-2024. Could you provide some insights of integrating them all together -- configs of vllm, openr1 and lighteval? thanks you. |
Evaluation short description
In the technical report of DeepSeek R1, detailed reasoning results are provided for datasets such as Math-500, GPQA, and AIME. It’s exciting to see that lighteval supports all of them—thank you for this fantastic work!
However, while attempting to reproduce these results, I encountered some issues. The report specifies using do_sampling=True, but I couldn’t find clear instructions on how to configure this setting. To address this, I modified the initialization function of generation_config in the transformers package. Does this approach make sense?
Additionally, I noticed that generation_config is initialized six times—three times before loading weights and three times after. Could you clarify the purpose of these multiple initializations?
I’d really appreciate any insights. Thanks again for your work!
@clefourrier @NathanHB @hynky1999
(Not sure who to tag, so I randomly @ some contributors. Please feel free to ignore if this is not relevant to you!)
Evaluation metadata
Provide all available
The text was updated successfully, but these errors were encountered: