Skip to content

Question about APO vs. GRPO choice for SmolLM3's reasoning capabilities #2959

@JenWei0312

Description

@JenWei0312

I've been studying SmolLM3's dual-mode training approach and have a technical question about the choice of Anchored Preference Optimization (APO) over Group Relative Policy Optimization (GRPO) for handling reasoning capabilities.

Based on my understanding of both approaches:

  1. APO (like DPO) works well for general instruction following and can handle reasoning tasks given appropriate preference data, which you generated using Qwen models
  2. GRPO was specifically designed for mathematical reasoning with process supervision and eliminates the need for a value model, potentially offering computational efficiency advantages

I'm hypothesizing that APO was chosen because:

  • It provided a unified alignment approach for both reasoning and non-reasoning modes
  • It worked well with your synthetic preference data generation pipeline
  • You're treating reasoning as a specialized mode of instruction following rather than a fundamentally different task
  • The computational benefits of GRPO might not have outweighed the implementation complexity for your specific training setup

Could you clarify if I'm on the right track with this understanding? I'm particularly interested in whether you considered GRPO for the reasoning optimization and what factors ultimately led to choosing APO for both modes.

Thank you for sharing these details about SmolLM3's training recipe - the dual-mode approach and training pipeline are fascinating!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions