Skip to content

Conversation

@alyosha-swamy
Copy link

Summary

This PR adds support for the AFMoE (Arcee Foundational Mixture of Experts) model architecture for the upcoming Trinity-Mini and Trinity-Nano releases. AFMoE is a decoder-only transformer model featuring a sparse Mixture of Experts (MoE) approach, combining token-choice routing with shared experts and several architectural innovations for efficient inference and improved performance.

Model Description

AFMoE features the following key architectural components:

  • Mixture of Experts with Shared Experts: Combines routed experts (activated per-token via learned routing) with always-active shared experts for stable base computation

  • Token-Choice Routing: Uses sigmoid or softmax-based routing with normalization and scaling for expert selection

  • Q/K Normalization and Gating: Applies RMSNorm to query and key projections and uses sigmoid gating on attention outputs for improved training stability

  • Hybrid Attention Patterns: Alternates between sliding window attention and full attention across layers for efficiency with long contexts

  • Dual Normalization: Uses pre- and post-normalization around both attention and MLP blocks for training stability

  • Configurable Dense Layers: Allows initial layers to use dense MLPs before transitioning to sparse MoE layers (num_dense_layers)

Implementation Details

  • Modular implementation leveraging transformers' modular architecture:

    • Efficient AfmoeRMSNorm for layer normalization

    • AfmoeRotaryEmbedding for positional encoding

    • AfmoeAttention class implementing Q/K normalization and output gating

    • AfmoeTokenChoiceRouter for expert selection

    • AfmoeMoE class implementing shared + routed experts architecture

    • AfmoeDecoderLayer integrating attention and MoE blocks with dual normalization

Testing

  • Added comprehensive test suite following standard transformers test patterns
  • Tests for core functionality:
    • Model initialization and weight loading
    • Forward and backward passes
    • Attention mechanism (sliding window + full attention patterns)
    • MoE routing and expert selection
    • RoPE embeddings
    • KV cache compatibility
  • Integration tests with example checkpoints
  • Verified compatibility with existing transformer infrastructure
  • Model loading and inference verified with arcee-ai/Trinity-Mini

Documentation

  • Comprehensive model documentation in docs/source/en/model_doc/afmoe.md
  • Detailed architecture descriptions and usage examples
  • All configuration parameters documented with clear descriptions
  • Example code for both Pipeline and AutoModel usage patterns

@alyosha-swamy alyosha-swamy force-pushed the add_afmoe_model branch 4 times, most recently from 6b08d17 to e3ad5e9 Compare November 12, 2025 19:23
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, auto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants