Skip to content

[BUG] The dynamically quantized MoE model failed to deploy in vLLM. #1455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
liweiqing1997 opened this issue Mar 13, 2025 · 5 comments
Open
Labels
bug Something isn't working

Comments

@liweiqing1997
Copy link

Describe the bug

The issue described involves a KeyError when attempting to deploy a quantized DeepSeek-V2-Lite-Chat model using the vllm framework. The error occurs during the weight-loading process, where the key names in the model's named_parameters() dictionary (params_dict) do not match the key names in the quantized weight file. Specifically:

Key Mismatch:

The original key in self.named_parameters() is 'model.layers.21.mlp.experts.w2_qweight'.
The processed key in the quantized weight file is 'model.layers.21.mlp.experts.w2_weight'.

Error Cause:

When loading the weights, the code attempts to access param = params_dict[name], but the name from the weight file does not exist in params_dict, resulting in a KeyError.

vllm version:

vllm-main commit hash debd6bb
or
https://github.com/ZZBoom/vllm/commits/main/ 
commit hash fc7c714854f422a7e000bcc9fa31d4f61796a7b6

How can this issue be resolved?

Error stack trace:


File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/utils.py", line 2238, in run_method
    return func(*args, **kwargs)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/model_runner.py", line 1113, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
    return loader.load_model(vllm_config=vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/loader.py", line 426, in load_model
    loaded_weights = model.load_weights(
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/models/deepseek_v2.py", line 790, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.10.mlp.experts.w2_weight

My dynamic quantization settings are set to the default configuration from the gptqmodel homepage:


python

dynamic = {
# .*\. matches the layers_node prefix
# layer index starts at 0

# positive match: layer 19, gate module 
r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  

# positive match: layer 20, gate module (prefix defaults to positive if missing)
r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  

# negative match: skip layer 21, gate module
r"-:.*\.20\..*gate.*": {}, 

# negative match: skip all down modules for all layers
r"-:.*down.*": {},  
}

The config after quantization is:


{
  "_name_or_path": "/mnt/models/deepseek-ai/DeepSeek-V2-Lite-Chat/",
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 10944,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1408,
  "moe_layer_freq": 1,
  "n_group": 1,
  "n_routed_experts": 64,
  "n_shared_experts": 2,
  "norm_topk_prob": false,
  "num_attention_heads": 16,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 27,
  "num_key_value_heads": 16,
  "pretraining_tp": 1,
  "q_lora_rank": null,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "bits": 8,
    "checkpoint_format": "gptq",
    "desc_act": false,
    "dynamic": {
      "+:.*\\.18\\..*gate.*": {
        "bits": 4,
        "group_size": 32
      },
      "-:.*\\.20\\..*gate.*": {},
      "-:.*down.*": {},
      ".*\\.19\\..*gate.*": {
        "bits": 8,
        "group_size": 64
      }
    },
    "group_size": 64,
    "lm_head": false,
    "meta": {
      "damp_auto_increment": 0.0025,
      "damp_percent": 0.01,
      "mse": 0.0,
      "quantizer": [
        "gptqmodel:2.0.0-dev"
      ],
      "static_groups": false,
      "true_sequential": true,
      "uri": "https://github.com/modelcloud/gptqmodel"
    },
    "pack_dtype": "int32",
    "quant_method": "gptq",
    "sym": true
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 0.707,
    "mscale_all_dim": 0.707,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 1.0,
  "scoring_func": "softmax",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 1,
  "topk_method": "greedy",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.3",
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 102400
}

Besides, GPTQModel.load test is ok

Image

Image

GPU Info
H20

Show output of:

nvidia-smi

Image

Software Info

Operation System/Version + Python Version

Show output of:

pip show gptqmodel torch transformers accelerate triton
# pip show gptqmodel torch transformers accelerate triton
Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, colorlog, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, flash_attn, flashinfer-python, gptqmodel, lm_eval, optimum, outlines, peft, runai-model-streamer, timm, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.48.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: auto_gptq, compressed-tensors, gptqmodel, lm_eval, optimum, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.3.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: auto_gptq, gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch

@liweiqing1997 liweiqing1997 added the bug Something isn't working label Mar 13, 2025
@Qubitium
Copy link
Collaborator

@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.

@liweiqing1997
Copy link
Author

@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.

You can use this model: https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite-Chat/files. We also used this model, but only utilized 27 of its layers.

@Qubitium
Copy link
Collaborator

@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?

@liweiqing1997
Copy link
Author

@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?

I'm very sorry, our model involves private data and may not be convenient to share.

Do you have the resources to quantize a small MoE model, such as DeepSeek-V2-Lite-Chat? If it's not convenient for you, I'll think of other solutions. Thank you very much.

@Qubitium
Copy link
Collaborator

@liweiqing1997 Totally understand. We will try to quant and fix this by next week. The bug is most likely in vLLM change model parameter names based on your stack traces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants