You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some models are too large to fully load into the GPU, but (GPTQ and AWQ in particular) quantization is too slow on the CPU. Is it possible to load the model to CPU but quantize layer by layer, and before quantizing each layer move that layer to the GPU to do the quantization, then offload the layer after quantization finishes.
Most other quantization frameworks support this, for example in GPTQModel:
Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device during quantization to minimize vram usage.
The text was updated successfully, but these errors were encountered:
Some models are too large to fully load into the GPU, but (GPTQ and AWQ in particular) quantization is too slow on the CPU. Is it possible to load the model to CPU but quantize layer by layer, and before quantizing each layer move that layer to the GPU to do the quantization, then offload the layer after quantization finishes.
Most other quantization frameworks support this, for example in GPTQModel:
The text was updated successfully, but these errors were encountered: