-
"Big-little" Llama
LQER runs a high-rank low-precision GEMM and a group of low-rank high-precision GEMMs in parallel to push the limitation of lossless LLM PTQ.
-
The DeepWok Lab, is an ML research group led by Dr. Aaron Zhao, where the group members are mainly from Imperial College London and the University of Cambridge.
- 🎉🎉🎉 Our work is accepted in ICML2024.
LQER is a post-training-quantization method that
- shapes the singular value distribution of approximated quantization error;
- enjoys a static compute pattern and unified memory/compute number formats;
- eliminates the needs of grid search, knowledge distillation, or other forms of iterative optimization.
- achieves near-lossless W4A8 LLM PTQ while using 1.36
$\times$ hardware resources compared to SoTA methods.
Anaconda is recommended. Run the following commands to create a conda environment named lqer.
conda env create -f environment.yml
conda run -n lqer python -m pip install -r requirements.txt
Note that this lqer env is for running LQER experiments. The baseline methods such as AWQ, GPTQ, and LLM.int4() included in the paper requires another env setup. Please follow the HuggingFace Transformer quantization guide to replicate baseline results.
Entry point is at experiments/pipeline/pipeline.py. This pipeline.py
performs data calibration, approximation, software-emulated quantization, perplexity evaluation, and downstream task evaluation for
The pipeline.py
requires one argument CONFIG
, which should be a toml file that specifies the experiment settings:
cd experiments/pipeline
conda run -n lqer python pipeline.py CONFIG
Please refer to the toml templates in ./experiments/configs/template for the configuration file format.
Scripts for replicating paper results:
Script | Note |
---|---|
experiments/pipeline/sweep_lqer_act.sh | W4A8 |
experiments/pipeline/sweep_lqer_act_int.sh | W4A8 |
experiments/pipeline/sweep_lqer_svd.sh | Baseline W4A8 |
experiments/pipeline/sweep_baseline_no_lqer.sh | Baseline W4A8 MXINT w/o |
Under the hood, these scripts call pipeline.py
and overwrite the key-value pairs in the passed config template to generate the corresponding experiment setup.
Each script requires one argument CONFIG
and one argument TAG
. The CONFIG
should be a config template toml file. The TAG
is a string that will be used to name the output directory.
cd experiments/pipeline
./sweep_xxx.sh CONFIG TAG
For example, to replicate the W4A8
cd experiments/pipeline
./sweep_lqer_act.sh ../configs/template/llama-7b.toml my-llama-7b-tag
If you find this work helpful, please consider citing:
@article{zhang2024lqer,
title={LQER: Low-Rank Quantization Error Reconstruction for LLMs},
author={Zhang, Cheng and Cheng, Jianyi and Constantinides, George A and Zhao, Yiren},
journal={arXiv preprint arXiv:2402.02446},
year={2024}
}