Description
Describe the bug
I have quantized Qwen2.5-VL into W4A16 with GPTQ method and compared the time cost during the decode stage.
The first image below shows the time taken to decode a single token using the unquantized version of Qwen2.5-VL-3B-Instruct. The second image shows the time taken to decode a single token after quantization of W4A16. Surprisingly, the time taken after quantization seems even longer than before quantization.
related issue: #1591
Expected behavior
Since the decode phase is typically memory-bound, I would expect quantization to improve performance by reducing memory bandwidth usage. However, the latency of the quantized model appears similar to that of the unquantized one. Why is that the case? Shouldn’t it be faster with quantization?
Environment
Include all relevant environment information:
- Ubuntu22.04
- Python version:3.10.16
- LLM Compressor version: v0.6.0
- torch version: 2.6.0+cu124
- vLLM version: 0.8.5
- CUDA version: 12.8
- transformers version: 4.52.4
- GPU: nvidia H20