The decode stage of Qwen2.5-VL did not perform as well as expected after quantization of GPTQ-W4A16.

**Describe the bug**
I have quantized Qwen2.5-VL into W4A16 with GPTQ method and compared the time cost during the decode stage.
The first image below shows the time taken to decode a single token using the unquantized version of Qwen2.5-VL-3B-Instruct. The second image shows the time taken to decode a single token after quantization of W4A16. Surprisingly, the time taken **after quantization** seems even **longer** than **before quantization**.
  
![Image](https://github.com/user-attachments/assets/0ba67e74-8a8b-4e91-9b44-ab9dd5fd3d0d)

![Image](https://github.com/user-attachments/assets/08c9caf9-a4f1-4d99-a465-4cc79055633a)

related issue: https://github.com/vllm-project/llm-compressor/issues/1591

**Expected behavior**
Since the decode phase is typically memory-bound, I would expect quantization to improve performance by reducing memory bandwidth usage. However, the latency of the quantized model appears similar to that of the unquantized one. Why is that the case? Shouldn’t it be faster with quantization?

**Environment**
Include all relevant environment information:
1. Ubuntu22.04
2. Python version:3.10.16
3. LLM Compressor version: v0.6.0
4. torch version: 2.6.0+cu124
5. vLLM version: 0.8.5
6. CUDA version: 12.8
7. transformers version: 4.52.4
8. GPU: nvidia H20


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The decode stage of Qwen2.5-VL did not perform as well as expected after quantization of GPTQ-W4A16. #1602

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The decode stage of Qwen2.5-VL did not perform as well as expected after quantization of GPTQ-W4A16. #1602

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions