You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
reasoning by GPT-4o + Rule-based Reward + GRPO = reasoning by GPT-4o + SFT
This is not a real RL; it is supervised learning. Just like the image classification, and the reward is the $1{\hat{y}==y^*}$. It also can employ RL optimization (like PPO or GRPO) to learn the supervised model, but it is not a real RL.