-
Notifications
You must be signed in to change notification settings - Fork 920
Open
Description
I've been studying SmolLM3's dual-mode training approach and have a technical question about the choice of Anchored Preference Optimization (APO) over Group Relative Policy Optimization (GRPO) for handling reasoning capabilities.
Based on my understanding of both approaches:
- APO (like DPO) works well for general instruction following and can handle reasoning tasks given appropriate preference data, which you generated using Qwen models
- GRPO was specifically designed for mathematical reasoning with process supervision and eliminates the need for a value model, potentially offering computational efficiency advantages
I'm hypothesizing that APO was chosen because:
- It provided a unified alignment approach for both reasoning and non-reasoning modes
- It worked well with your synthetic preference data generation pipeline
- You're treating reasoning as a specialized mode of instruction following rather than a fundamentally different task
- The computational benefits of GRPO might not have outweighed the implementation complexity for your specific training setup
Could you clarify if I'm on the right track with this understanding? I'm particularly interested in whether you considered GRPO for the reasoning optimization and what factors ultimately led to choosing APO for both modes.
Thank you for sharing these details about SmolLM3's training recipe - the dual-mode approach and training pipeline are fascinating!
Metadata
Metadata
Assignees
Labels
No labels