Why is the inference speed of a quantized model with int8 slower? #4129

pfldy2850 · 2023-08-10T05:54:07Z

pfldy2850
Aug 10, 2023

We are currently experimenting with the 12.8 billion GPT NeoX model using deepspeed inference (init_inference) on an A100 device.

When inferred with the float16 type, the average per-token generation latency was measured at 24ms, whereas with the int8 type, it was measured at 54ms.

Why is the inference speed of a quantized model with int8 slower?

As far as I know, I understand that operation efficiency can increase due to bit reduction. However, is it not the case in reality?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the inference speed of a quantized model with int8 slower? #4129

{{title}}

Replies: 0 comments

Select a reply

Why is the inference speed of a quantized model with int8 slower? #4129

pfldy2850 Aug 10, 2023

Replies: 0 comments

pfldy2850
Aug 10, 2023