-
Notifications
You must be signed in to change notification settings - Fork 116
Quantization Memory Requirements #1228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
HI @sneha5gsm Let me take a look at the raw memory requirements. Let me see if there is an equation similar to kv-cache to approximate. For 2.
|
Thank you for taking the time and looking into the first query and for the answers to the second one! Follow up questions for 2:
What if we have 2 GPUs? That way when one GPU is offloading the other is ready for computing? Thanks |
Yes you can if you are not concerned about the error propagation from one layer to another. GPTQ assumes that model are pretrained, which means that the activations of the model is relatively stable. This means that the prediction of a layer and the ground truth output is very close. This means we dont have to estimate the hessian using full backpropagation, so instead, look at the layer's input activation value to approximate the hessian. Hessian is used in gptq. So given a calibration dataset, because of the above, you can quantize the layers in parallel if you have the activations. One reason we do them sequentially is because output from one layer is the in input of others, so doing it sequentially addresses the errors as the layers are quantized. Another reason is doing it sequentially is doing it in parallel very expensive. x - the input activation - in llama is in the shape of roughly [batch, seq, d_model], using fp16, its roughly 1 MB for seq of 1024, d_model of 4096 and 2 bytes per value for batch of 1. Just for the activations. For offloading and computing to maximize usage that is something we can optimize. If you'd like, we can guide you how you can optimize this can contribute in the repo |
@horheynm Also for:
How would I go about contributing for the same? |
Hello!
I was trying the various quantization recipes for quantizing a 70B Llama 3 based model to FP8, INT8, INT4(A16) precisions as mentioned in the quantization docs by vLLM.
I understand that the
calculate_offload_device_map
creates a custom device map by reserving memory forGPTQ (reserve_for_hessians), but I would still like to understand the memory requirements to be able to utilize the GPU memory efficiently, to understand where all the GPU memory is consumed and to ensure that there are no bugs.
Thank you!
The text was updated successfully, but these errors were encountered: