Load the model to CPU but quantize using the GPU

Some models are too large to fully load into the GPU, but (GPTQ and AWQ in particular) quantization is too slow on the CPU. Is it possible to load the model to CPU but quantize layer by layer, and before quantizing each layer move that layer to the GPU to do the quantization, then offload the layer after quantization finishes.
Most other quantization frameworks support this, for example in [GPTQModel](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.1):
```
Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load the model to CPU but quantize using the GPU #1383

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Load the model to CPU but quantize using the GPU #1383

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions