Skip to content

[EVAL] Clarification on Reproducing DeepSeek R1 Results with do_sampling=True #631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Louym opened this issue Mar 20, 2025 · 14 comments
Open
Labels

Comments

@Louym
Copy link

Louym commented Mar 20, 2025

Evaluation short description

In the technical report of DeepSeek R1, detailed reasoning results are provided for datasets such as Math-500, GPQA, and AIME. It’s exciting to see that lighteval supports all of them—thank you for this fantastic work!

Image

However, while attempting to reproduce these results, I encountered some issues. The report specifies using do_sampling=True, but I couldn’t find clear instructions on how to configure this setting. To address this, I modified the initialization function of generation_config in the transformers package. Does this approach make sense?

Additionally, I noticed that generation_config is initialized six times—three times before loading weights and three times after. Could you clarify the purpose of these multiple initializations?

I’d really appreciate any insights. Thanks again for your work!
@clefourrier @NathanHB @hynky1999
(Not sure who to tag, so I randomly @ some contributors. Please feel free to ignore if this is not relevant to you!)

Evaluation metadata

Provide all available

@Louym Louym added the new-task label Mar 20, 2025
@NathanHB
Copy link
Member

hi ! you can set generation parameters using a config file, see here at the bottom of the page.

@Louym
Copy link
Author

Louym commented Mar 24, 2025

@NathanHB Thank you so much for your reply! I still have one question. If I just want to evaluate the model locally with vllm as the backend, how can I use sampling?

@NathanHB
Copy link
Member

the same way, if you use temperature, it will activate sampling

@Louym
Copy link
Author

Louym commented Mar 24, 2025

Thank you very much! Just one more question: can sampling be performed when using Accelerate?

@NathanHB
Copy link
Member

NathanHB commented Mar 24, 2025

yes ! But we advise on using vllm especially for bigger models

@Louym
Copy link
Author

Louym commented Mar 24, 2025

Thank you so much!
When using vllm, I encountered another issue, since the reasoning benchmark requires many tokens.
I wonder where to modify the self.max_length.

[2025-03-24 09:46:45,253] [ WARNING]: context_size + max_new_tokens=11004 which is greater than self.max_length=8192. Truncating context to 0 tokens. (vllm_model.py:266)

@NathanHB
Copy link
Member

self.max_length is modified using max_model_length model arg.

@Louym
Copy link
Author

Louym commented Mar 24, 2025

Thank you so much! If i want get the results of AIME 2024 cons@64, should I write this in the yaml file?
metric_options: # Optional metric arguments
exact_match@1: 16
num_samples: 16

@NathanHB
Copy link
Member

yes !

@NathanHB
Copy link
Member

though @plaguss will know more than me about this :)

@Louym
Copy link
Author

Louym commented Mar 24, 2025

@NathanHB Thank you for your feedback! I really appreciate you taking the time to investigate this.

Regarding pass@K, I've implemented it as shown here, but the runtime appears unchanged. @plaguss Would you have any insights on how we might better capture/interpret the results?

Thank you so much! If i want get the results of AIME 2024 cons@64, should I write this in the yaml file? metric_options: # Optional metric arguments exact_match@1: 16 num_samples: 16

@rawsh
Copy link
Contributor

rawsh commented Mar 26, 2025

This works well for me: #618 (comment)

I've been able to reproduce the DeepScaleR pass@1 n=16 results:

Image

@plaguss @NathanHB Wonder what y'all think about support for reporting multiple metrics? E.g. when sampling 16 times you should be able to report pass@1, pass@16, and consensus@16 from the same data. Would also love more control over sampling so it's easier to implement custom "reranking" setups like Best-of-N

@ZepinLi
Copy link

ZepinLi commented Mar 31, 2025

@rawsh
I'm also working on reproducing DeepScaleR pass@1 n=16 results. Could you please share with me the detailed pipeline of using lighteval to do this? It seems like that lighteval uses n=1 for default setting.

@ZepinLi
Copy link

ZepinLi commented Mar 31, 2025

hi @NathanHB, I'm testing AIME pass@1 (n>1) score with script like: https://github.com/huggingface/open-r1?tab=readme-ov-file#aime-2024. Could you provide some insights of integrating them all together -- configs of vllm, openr1 and lighteval?

thanks you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants