Skip to content

Load the model to CPU but quantize using the GPU #1383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sgsdxzy opened this issue Apr 25, 2025 · 1 comment
Open

Load the model to CPU but quantize using the GPU #1383

sgsdxzy opened this issue Apr 25, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@sgsdxzy
Copy link

sgsdxzy commented Apr 25, 2025

Some models are too large to fully load into the GPU, but (GPTQ and AWQ in particular) quantization is too slow on the CPU. Is it possible to load the model to CPU but quantize layer by layer, and before quantizing each layer move that layer to the GPU to do the quantization, then offload the layer after quantization finishes.
Most other quantization frameworks support this, for example in GPTQModel:

Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.
@sgsdxzy sgsdxzy added the enhancement New feature or request label Apr 25, 2025
@kylesayrs
Copy link
Collaborator

kylesayrs commented Apr 29, 2025

This is already partially by GPTQ (and soon to be AWQ)!. You can specify how large a layer is by specifying GPTQModifier(sequential_targets=...)

This will be implemented once #1263 lands

@kylesayrs kylesayrs self-assigned this Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants