You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you for your excellent work! In the paper, you mentioned using a small amount of data for supervised fine-tuning (SFT) warm-up before the RL phase. Could you kindly clarify:
Data Scale: What is the specific data size used for the SFT warm-up phase?
Proportion to RL Data: What percentage does this warm-up data account for relative to the total data used in the RL training stage?
Impact Analysis: How does the ratio of SFT-to-RL data (whether higher or lower) affect the final RL evaluation metrics? Have you observed any notable patterns or thresholds in your experiments?
This clarification would greatly help readers understand the relationship between warm-up strategies and RL optimization.