For FP8 Fused MoE layers, only per-tensor scalesfor weights and activations are supporte? #1393

shuxiaobo · 2025-04-28T02:13:09Z

Describe the bug
A clear and concise description of what the bug is.
quantlization a FP8-dynamic moe model by https://docs.vllm.ai/en/latest/features/quantization/fp8.html
but vllm 0.7.3 can not load it

ValueError: For FP8 Fused MoE layers, only per-tensor scalesfor weights and activations are supported. Found num_bits=8 type='float' symmetric=True group_size=None strategy='channel' block_structure=None dynamic=False actorder=None observer='minmax' observer_kwargs={}, num_bits=8 type='float' symmetric=True group_size=None strategy='token' block_structure=None dynamic=True actorder=None observer=None observer_kwargs={}

brian-dellabetta · 2025-04-29T20:02:02Z

Hi @shuxiaobo , vllm only supports a subset of all the compression configurations possible, particularly for MoE layers. The latest version (0.8.4) would have better support, but I'm not sure if it would support this particular use case. If not, you can switch to strategy="tensor" instead of "channel" / "token", or open a feature request in https://github.com/vllm-project/vllm

shuxiaobo added the bug Something isn't working label Apr 28, 2025

brian-dellabetta self-assigned this Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For FP8 Fused MoE layers, only per-tensor scalesfor weights and activations are supporte? #1393

For FP8 Fused MoE layers, only per-tensor scalesfor weights and activations are supporte? #1393

shuxiaobo commented Apr 28, 2025 •

edited

Loading

brian-dellabetta commented Apr 29, 2025

For FP8 Fused MoE layers, only per-tensor scalesfor weights and activations are supporte? #1393

For FP8 Fused MoE layers, only per-tensor scalesfor weights and activations are supporte? #1393

Comments

shuxiaobo commented Apr 28, 2025 • edited Loading

brian-dellabetta commented Apr 29, 2025

shuxiaobo commented Apr 28, 2025 •

edited

Loading