Open
Description
Hi, thanks for your great work. But i am confused about the data-collecting code.
As you mentioned:
-
Download policy data (positive samples) for training 1st policy model (Llama3-8b-Instruct): [Hugging Face]
-
Download PRM data (positive and negative samples) for training 1st reward model (Mistral-7B: MetaMATH): [Hugging Face]
how can i get these two data from your code? Is it from codes like self_train/generation/generate_both_samples_MATH.py or evaluate.py?
what is the key parameters to change to get these two data?