-
Notifications
You must be signed in to change notification settings - Fork 390
Description
🔖 Feature description
Support MXFP4 (Microscaling) Format for QAT and Post-Training Quantization via torchao/Model-Optimizer.
Currently, Axolotl users attempting to use 4-bit floating-point formats may run into hardware-specific constraints (e.g., the nvfp4 error which is exclusive to Blackwell sm100). This feature request proposes adding support for MXFP4 (E2M1), a hardware-agnostic OCP standard that is supported on NVIDIA Hopper (H100/H800) and can be emulated efficiently on Ampere.
Implementing MXFP4 QAT will allow:
- Higher training stability compared to INT4/FP4.
- Better post-training weight compression for LLMs like
gpt-oss. - Alignment with NVIDIA's
model-optimizerandtorchaoroadmaps.
✔️ Solution
Integrate torchao.quantization.quantize_ with MXFP4 specific configs or utilize NVIDIA's modelopt (Model Optimizer) workflow within Axolotl's quantization CLI.
Key components:
- Add
mxfp4as a valid option forquantization.weight_dtypein the YAML config. - Implement the MXFP4 fake-quantization logic in
axolotl.utils.quantizationduring the QAT phase. - Ensure compatibility with
torchao's MX format implementations (specificallymx_fp4).
References:
❓ Alternatives
Currently, users are forced to use int4_weight_only or fp8, which either lacks the dynamic range of MXFP4 or doesn't provide the same 4-bit memory savings.
📝 Additional Context
As LLMs like gpt-oss (120B+) grow, 4-bit quantization becomes critical for inference. MXFP4 provides a sweet spot between 8-bit accuracy and 4-bit efficiency by using shared scales across groups of elements (e.g., block size 16 or 32).