You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for gptqmodel quantization. This is a replacement for
auto-gptq.
For now, both packages are supported, but since auto-gptq is no longer
being developed, it will be deprecated and removed at some point in the
future.
---------
Signed-off-by: jiqing-feng <[email protected]>
Co-authored-by: LRL-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>
Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: LRL <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
You can learn more about gptq based `[2, 3, 4, 8]` bits quantization at [GPTQModel](https://github.com/ModelCloud/GPTQModel) and the Transformers [GPTQ](https://huggingface.co/docs/transformers/quantization/gptq) doc. Post-quant training, PEFT can use both [GPTQModel](https://github.com/ModelCloud/GPTQModel) or [AutoGPTQ](https://github.com/autogptq/autogptq) libraries, but we recommend GPTQModel because AutoGPTQ will be deprecated in a future release.
113
+
114
+
```bash
115
+
# gptqmodel install
116
+
pip install gptqmodel --no-build-isolation
117
+
```
118
+
119
+
```py
120
+
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
Once quantized, you can post-train GPTQ models with PEFT APIs.
135
+
110
136
## AQLM quantization
111
137
112
138
Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. This allows it to compress models down to as low as 2-bit with considerably low accuracy losses.
0 commit comments