Skip to content

The decode stage of Qwen2.5-VL did not perform as well as expected after quantization of GPTQ-W4A16. #1602

Open
@KarlDe1

Description

@KarlDe1

Describe the bug
I have quantized Qwen2.5-VL into W4A16 with GPTQ method and compared the time cost during the decode stage.
The first image below shows the time taken to decode a single token using the unquantized version of Qwen2.5-VL-3B-Instruct. The second image shows the time taken to decode a single token after quantization of W4A16. Surprisingly, the time taken after quantization seems even longer than before quantization.

Image

Image

related issue: #1591

Expected behavior
Since the decode phase is typically memory-bound, I would expect quantization to improve performance by reducing memory bandwidth usage. However, the latency of the quantized model appears similar to that of the unquantized one. Why is that the case? Shouldn’t it be faster with quantization?

Environment
Include all relevant environment information:

  1. Ubuntu22.04
  2. Python version:3.10.16
  3. LLM Compressor version: v0.6.0
  4. torch version: 2.6.0+cu124
  5. vLLM version: 0.8.5
  6. CUDA version: 12.8
  7. transformers version: 4.52.4
  8. GPU: nvidia H20

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions