Skip to content

Support gpt-oss mxfp4 format qat #3547

@shelterwff-byte

Description

@shelterwff-byte

🔖 Feature description

Support MXFP4 (Microscaling) Format for QAT and Post-Training Quantization via torchao/Model-Optimizer.

Currently, Axolotl users attempting to use 4-bit floating-point formats may run into hardware-specific constraints (e.g., the nvfp4 error which is exclusive to Blackwell sm100). This feature request proposes adding support for MXFP4 (E2M1), a hardware-agnostic OCP standard that is supported on NVIDIA Hopper (H100/H800) and can be emulated efficiently on Ampere.

Implementing MXFP4 QAT will allow:

  1. Higher training stability compared to INT4/FP4.
  2. Better post-training weight compression for LLMs like gpt-oss.
  3. Alignment with NVIDIA's model-optimizer and torchao roadmaps.

✔️ Solution

Integrate torchao.quantization.quantize_ with MXFP4 specific configs or utilize NVIDIA's modelopt (Model Optimizer) workflow within Axolotl's quantization CLI.

Key components:

  • Add mxfp4 as a valid option for quantization.weight_dtype in the YAML config.
  • Implement the MXFP4 fake-quantization logic in axolotl.utils.quantization during the QAT phase.
  • Ensure compatibility with torchao's MX format implementations (specifically mx_fp4).

References:

❓ Alternatives

Currently, users are forced to use int4_weight_only or fp8, which either lacks the dynamic range of MXFP4 or doesn't provide the same 4-bit memory savings.

📝 Additional Context

As LLMs like gpt-oss (120B+) grow, 4-bit quantization becomes critical for inference. MXFP4 provides a sweet spot between 8-bit accuracy and 4-bit efficiency by using shared scales across groups of elements (e.g., block size 16 or 32).

axolotl-ai-cloud/axolotl#3333


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions