[BUG] The dynamically quantized MoE model failed to deploy in vLLM. #1455

liweiqing1997 · 2025-03-13T09:04:27Z

Describe the bug

The issue described involves a KeyError when attempting to deploy a quantized DeepSeek-V2-Lite-Chat model using the vllm framework. The error occurs during the weight-loading process, where the key names in the model's named_parameters() dictionary (params_dict) do not match the key names in the quantized weight file. Specifically:

Key Mismatch:

The original key in self.named_parameters() is 'model.layers.21.mlp.experts.w2_qweight'.
The processed key in the quantized weight file is 'model.layers.21.mlp.experts.w2_weight'.

Error Cause:

When loading the weights, the code attempts to access param = params_dict[name], but the name from the weight file does not exist in params_dict, resulting in a KeyError.

vllm version:

vllm-main commit hash debd6bb
or
https://github.com/ZZBoom/vllm/commits/main/ 
commit hash fc7c714854f422a7e000bcc9fa31d4f61796a7b6

How can this issue be resolved?

Error stack trace:


File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/utils.py", line 2238, in run_method
    return func(*args, **kwargs)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/model_runner.py", line 1113, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
    return loader.load_model(vllm_config=vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/loader.py", line 426, in load_model
    loaded_weights = model.load_weights(
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/models/deepseek_v2.py", line 790, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.10.mlp.experts.w2_weight

My dynamic quantization settings are set to the default configuration from the gptqmodel homepage:


python

dynamic = {
# .*\. matches the layers_node prefix
# layer index starts at 0

# positive match: layer 19, gate module 
r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  

# positive match: layer 20, gate module (prefix defaults to positive if missing)
r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  

# negative match: skip layer 21, gate module
r"-:.*\.20\..*gate.*": {}, 

# negative match: skip all down modules for all layers
r"-:.*down.*": {},  
}

The config after quantization is:


{
  "_name_or_path": "/mnt/models/deepseek-ai/DeepSeek-V2-Lite-Chat/",
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 10944,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1408,
  "moe_layer_freq": 1,
  "n_group": 1,
  "n_routed_experts": 64,
  "n_shared_experts": 2,
  "norm_topk_prob": false,
  "num_attention_heads": 16,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 27,
  "num_key_value_heads": 16,
  "pretraining_tp": 1,
  "q_lora_rank": null,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "bits": 8,
    "checkpoint_format": "gptq",
    "desc_act": false,
    "dynamic": {
      "+:.*\\.18\\..*gate.*": {
        "bits": 4,
        "group_size": 32
      },
      "-:.*\\.20\\..*gate.*": {},
      "-:.*down.*": {},
      ".*\\.19\\..*gate.*": {
        "bits": 8,
        "group_size": 64
      }
    },
    "group_size": 64,
    "lm_head": false,
    "meta": {
      "damp_auto_increment": 0.0025,
      "damp_percent": 0.01,
      "mse": 0.0,
      "quantizer": [
        "gptqmodel:2.0.0-dev"
      ],
      "static_groups": false,
      "true_sequential": true,
      "uri": "https://github.com/modelcloud/gptqmodel"
    },
    "pack_dtype": "int32",
    "quant_method": "gptq",
    "sym": true
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 0.707,
    "mscale_all_dim": 0.707,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 1.0,
  "scoring_func": "softmax",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 1,
  "topk_method": "greedy",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.3",
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 102400
}

Besides, GPTQModel.load test is ok

GPU Info
H20

Show output of:

nvidia-smi

Software Info

Operation System/Version + Python Version

Show output of:

pip show gptqmodel torch transformers accelerate triton
# pip show gptqmodel torch transformers accelerate triton
Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, colorlog, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, flash_attn, flashinfer-python, gptqmodel, lm_eval, optimum, outlines, peft, runai-model-streamer, timm, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.48.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: auto_gptq, compressed-tensors, gptqmodel, lm_eval, optimum, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.3.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: auto_gptq, gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch

The text was updated successfully, but these errors were encountered:

Qubitium · 2025-03-13T11:09:47Z

@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.

liweiqing1997 · 2025-03-13T11:14:54Z

@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.

You can use this model: https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite-Chat/files. We also used this model, but only utilized 27 of its layers.

Qubitium · 2025-03-13T11:20:56Z

@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?

liweiqing1997 · 2025-03-13T12:10:26Z

@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?

I'm very sorry, our model involves private data and may not be convenient to share.

Do you have the resources to quantize a small MoE model, such as DeepSeek-V2-Lite-Chat? If it's not convenient for you, I'll think of other solutions. Thank you very much.

Qubitium · 2025-03-13T12:30:35Z

@liweiqing1997 Totally understand. We will try to quant and fix this by next week. The bug is most likely in vLLM change model parameter names based on your stack traces.

liweiqing1997 added the bug Something isn't working label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] The dynamically quantized MoE model failed to deploy in vLLM. #1455

[BUG] The dynamically quantized MoE model failed to deploy in vLLM. #1455

liweiqing1997 commented Mar 13, 2025

Qubitium commented Mar 13, 2025

Uh oh!

liweiqing1997 commented Mar 13, 2025

Uh oh!

Qubitium commented Mar 13, 2025

Uh oh!

liweiqing1997 commented Mar 13, 2025

Uh oh!

Qubitium commented Mar 13, 2025

Uh oh!

[BUG] The dynamically quantized MoE model failed to deploy in vLLM. #1455

[BUG] The dynamically quantized MoE model failed to deploy in vLLM. #1455

Comments

liweiqing1997 commented Mar 13, 2025

Qubitium commented Mar 13, 2025

Uh oh!

liweiqing1997 commented Mar 13, 2025

Uh oh!

Qubitium commented Mar 13, 2025

Uh oh!

liweiqing1997 commented Mar 13, 2025

Uh oh!

Qubitium commented Mar 13, 2025

Uh oh!