Closed
Description
Some models are too large to fully load into the GPU, but (GPTQ and AWQ in particular) quantization is too slow on the CPU. Is it possible to load the model to CPU but quantize layer by layer, and before quantizing each layer move that layer to the GPU to do the quantization, then offload the layer after quantization finishes.
Most other quantization frameworks support this, for example in GPTQModel:
Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.