Skip to content

Load the model to CPU but quantize using the GPU #1383

Closed
@sgsdxzy

Description

@sgsdxzy

Some models are too large to fully load into the GPU, but (GPTQ and AWQ in particular) quantization is too slow on the CPU. Is it possible to load the model to CPU but quantize layer by layer, and before quantizing each layer move that layer to the GPU to do the quantization, then offload the layer after quantization finishes.
Most other quantization frameworks support this, for example in GPTQModel:

Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions