You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @shuxiaobo , vllm only supports a subset of all the compression configurations possible, particularly for MoE layers. The latest version (0.8.4) would have better support, but I'm not sure if it would support this particular use case. If not, you can switch to strategy="tensor" instead of "channel" / "token", or open a feature request in https://github.com/vllm-project/vllm
Describe the bug
A clear and concise description of what the bug is.
quantlization a FP8-dynamic moe model by https://docs.vllm.ai/en/latest/features/quantization/fp8.html
but vllm 0.7.3 can not load it
ValueError: For FP8 Fused MoE layers, only per-tensor scalesfor weights and activations are supported. Found num_bits=8 type='float' symmetric=True group_size=None strategy='channel' block_structure=None dynamic=False actorder=None observer='minmax' observer_kwargs={}, num_bits=8 type='float' symmetric=True group_size=None strategy='token' block_structure=None dynamic=True actorder=None observer=None observer_kwargs={}
The text was updated successfully, but these errors were encountered: