[EVAL] Clarification on Reproducing DeepSeek R1 Results with do_sampling=True #631

Louym · 2025-03-20T14:53:33Z

Evaluation short description

In the technical report of DeepSeek R1, detailed reasoning results are provided for datasets such as Math-500, GPQA, and AIME. It’s exciting to see that lighteval supports all of them—thank you for this fantastic work!

However, while attempting to reproduce these results, I encountered some issues. The report specifies using do_sampling=True, but I couldn’t find clear instructions on how to configure this setting. To address this, I modified the initialization function of generation_config in the transformers package. Does this approach make sense?

Additionally, I noticed that generation_config is initialized six times—three times before loading weights and three times after. Could you clarify the purpose of these multiple initializations?

I’d really appreciate any insights. Thanks again for your work!
@clefourrier @NathanHB @hynky1999
(Not sure who to tag, so I randomly @ some contributors. Please feel free to ignore if this is not relevant to you!)

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/pdf/[2501.12948](https://arxiv.org/pdf/2501.12948)

NathanHB · 2025-03-24T12:13:09Z

hi ! you can set generation parameters using a config file, see here at the bottom of the page.

Louym · 2025-03-24T13:00:33Z

@NathanHB Thank you so much for your reply! I still have one question. If I just want to evaluate the model locally with vllm as the backend, how can I use sampling?

NathanHB · 2025-03-24T13:01:51Z

the same way, if you use temperature, it will activate sampling

Louym · 2025-03-24T13:26:39Z

Thank you very much! Just one more question: can sampling be performed when using Accelerate?

NathanHB · 2025-03-24T13:35:49Z

yes ! But we advise on using vllm especially for bigger models

Louym · 2025-03-24T13:51:59Z

Thank you so much!
When using vllm, I encountered another issue, since the reasoning benchmark requires many tokens.
I wonder where to modify the self.max_length.

[2025-03-24 09:46:45,253] [ WARNING]: context_size + max_new_tokens=11004 which is greater than self.max_length=8192. Truncating context to 0 tokens. (vllm_model.py:266)

NathanHB · 2025-03-24T13:59:13Z

self.max_length is modified using max_model_length model arg.

Louym · 2025-03-24T14:21:39Z

Thank you so much! If i want get the results of AIME 2024 cons@64, should I write this in the yaml file?
metric_options: # Optional metric arguments
exact_match@1: 16
num_samples: 16

NathanHB · 2025-03-24T14:39:50Z

yes !

NathanHB · 2025-03-24T14:40:15Z

though @plaguss will know more than me about this :)

Louym · 2025-03-24T14:54:31Z

@NathanHB Thank you for your feedback! I really appreciate you taking the time to investigate this.

Regarding pass@K, I've implemented it as shown here, but the runtime appears unchanged. @plaguss Would you have any insights on how we might better capture/interpret the results?

Thank you so much! If i want get the results of AIME 2024 cons@64, should I write this in the yaml file? metric_options: # Optional metric arguments exact_match@1: 16 num_samples: 16

rawsh · 2025-03-26T20:17:38Z

This works well for me: #618 (comment)

I've been able to reproduce the DeepScaleR pass@1 n=16 results:

@plaguss @NathanHB Wonder what y'all think about support for reporting multiple metrics? E.g. when sampling 16 times you should be able to report pass@1, pass@16, and consensus@16 from the same data. Would also love more control over sampling so it's easier to implement custom "reranking" setups like Best-of-N

ZepinLi · 2025-03-31T12:36:25Z

@rawsh
I'm also working on reproducing DeepScaleR pass@1 n=16 results. Could you please share with me the detailed pipeline of using lighteval to do this? It seems like that lighteval uses n=1 for default setting.

ZepinLi · 2025-03-31T12:57:55Z

hi @NathanHB, I'm testing AIME pass@1 (n>1) score with script like: https://github.com/huggingface/open-r1?tab=readme-ov-file#aime-2024. Could you provide some insights of integrating them all together -- configs of vllm, openr1 and lighteval?

thanks you.

Louym added the new-task label Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVAL] Clarification on Reproducing DeepSeek R1 Results with do_sampling=True #631

[EVAL] Clarification on Reproducing DeepSeek R1 Results with do_sampling=True #631

Louym commented Mar 20, 2025

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025 •

edited

Loading

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

rawsh commented Mar 26, 2025 •

edited

Loading

ZepinLi commented Mar 31, 2025 •

edited

Loading

ZepinLi commented Mar 31, 2025

[EVAL] Clarification on Reproducing DeepSeek R1 Results with do_sampling=True #631

[EVAL] Clarification on Reproducing DeepSeek R1 Results with do_sampling=True #631

Comments

Louym commented Mar 20, 2025

Evaluation short description

Evaluation metadata

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025 • edited Loading

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

NathanHB commented Mar 24, 2025

NathanHB commented Mar 24, 2025

Louym commented Mar 24, 2025

rawsh commented Mar 26, 2025 • edited Loading

ZepinLi commented Mar 31, 2025 • edited Loading

ZepinLi commented Mar 31, 2025

NathanHB commented Mar 24, 2025 •

edited

Loading

rawsh commented Mar 26, 2025 •

edited

Loading

ZepinLi commented Mar 31, 2025 •

edited

Loading