Question about APO vs. GRPO choice for SmolLM3's reasoning capabilities

I've been studying SmolLM3's dual-mode training approach and have a technical question about the choice of Anchored Preference Optimization (APO) over Group Relative Policy Optimization (GRPO) for handling reasoning capabilities.

Based on my understanding of both approaches:
1. APO (like DPO) works well for general instruction following and can handle reasoning tasks given appropriate preference data, which you generated using Qwen models
2. GRPO was specifically designed for mathematical reasoning with process supervision and eliminates the need for a value model, potentially offering computational efficiency advantages

I'm hypothesizing that APO was chosen because:
- It provided a unified alignment approach for both reasoning and non-reasoning modes
- It worked well with your synthetic preference data generation pipeline
- You're treating reasoning as a specialized mode of instruction following rather than a fundamentally different task
- The computational benefits of GRPO might not have outweighed the implementation complexity for your specific training setup

Could you clarify if I'm on the right track with this understanding? I'm particularly interested in whether you considered GRPO for the reasoning optimization and what factors ultimately led to choosing APO for both modes.

Thank you for sharing these details about SmolLM3's training recipe - the dual-mode approach and training pipeline are fascinating!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about APO vs. GRPO choice for SmolLM3's reasoning capabilities #2959

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about APO vs. GRPO choice for SmolLM3's reasoning capabilities #2959

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions