Name	Name	Last commit message	Last commit date
parent directory ..
legacy	legacy
modules	modules
patches	patches
utils	utils
README.md	README.md
install.sh	install.sh
run_sqft.py	run_sqft.py

SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models

Official implementation of SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models.

This repository contains the code for SQFT, an end-to-end solution for low-precision sparse parameter-efficient fine-tuning of LLMs. SQFT allows for effective model manipulation in resource-constrained environments. Specifically, the highlights of SQFT include:

SparsePEFT, an efficient and effective strategy for fine-tuning sparse models. It ensures the preservation of the base model's sparsity during merging through the use of sparse adapters.
Introduction of quantization scenarios (sparse and quantization). QA-SparsePEFT built on SparsePEFT, which allows PEFT fine-tuning to achieve a single INT4 and sparse model adapted to the specific domain.
Adopt Neural Low-rank Adapter Search (NLS) strategies into all pipelines and solutions.

Please refer to our paper for more details.

News

[2024.09.24] SQFT paper has been released (link) and accepted at EMNLP 2024 Findings. 📚
[2024.09.24] Release the code for SQFT. 🎉

Released Foundation Models 🤗

We have released several foundation models (sparse or sparse-and-quantized) for SQFT:

Source Model	Sparsity	Sparse Model	Sparse-and-Quantized Model
Mistral-7B-v0.3	50%	IntelLabs/sqft-mistral-7b-v0.3-50-base	IntelLabs/sqft-mistral-7b-v0.3-50-base-gptq
Phi-3-mini-4k-instruct	50%	IntelLabs/sqft-phi-3-mini-4k-50-base	IntelLabs/sqft-phi-3-mini-4k-50-base-gptq
Meta-Llama-3-8B	50%	IntelLabs/sqft-llama-3-8b-50-base^*	IntelLabs/sqft-llama-3-8b-50-base-gptq^*

^* Llama-3 models are currently under internal review and will be released soon.

Setup

Follow the steps below to use SQFT.

pip install virtualenv
virtualenv sqft-env
source sqft-env/bin/activate

# install pytorch
pip install torch==2.5.1

# install dependencies
bash install.sh

Quick start

Training and Evaluation

We use Llama-3-8B + GSM8K as an example to show how to instantiate SQFT's pipelines. For other settings and models, see the legacy version.

Sparsification

Before fine-tuning, SQFT employs a simple but effective pruning approach Wanda to sparsify the language model, serving as the base model (frozen) for adapter training. Clone the Wanda repo and apply our patch:

git clone https://github.com/locuslab/wanda.git && cd wanda && git checkout 8e8fc87 && git apply ../patches/wanda-8e8fc87.patch && cd ..

Below is an example command for unstructured sparsifying Llama-3-8B with Wanda to achieve unstructured 50% sparsity.

python wanda/main.py \
    --model meta-llama/Meta-Llama-3-8B \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save wanda_out \
    --save_model <path to sparse base model>

--model: The identifier for the model on the Hugging Face model hub or local path.
--sparsity_ratio: Specifies the percentage of weights to be pruned.
--save_model: Specifies the directory where the sparsified language model will be stored.

Further details can be referred to Wanda. You can skip this step and adopt our released sparse models (find them in the Sparse Model column of this Table). Note that the sparsification step can use any other weight sparsification algorithm and sparsity patterns, such as the current state-of-the-art sparsity method MaskLLM (2:4 sparsity). Feel free to try other sparse approaches for the base model before training.

Quantization

If you do not consider low precision, you can skip this step and directly jump to SQFT + SparsePEFT.

Quantize the base model using GPTQ:

python utils/quantization.py --base_model_path <path to sparse base model> --output_dir <path to quantized sparse base model>

You can also skip the quantization step and adopt our released quantized models (find them in the Sparse-and-Quantized Model column of this Table).

🛠️ SQFT

Fine-tuning

python run_sqft.py \
    --dataset_name gsm8k \
    --model_name_or_path <path to quantized sparse base model> \
    --do_train \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 3 \
    --learning_rate 1e-4 \
    --warmup_steps 100 \
    --lr_scheduler_type cosine \
    --optim adamw_torch \
    --fp16 \
    --output_dir <path to super-adapter> \
    --logging_steps 20 \
    --save_strategy epoch \
    --save_total_limit 2 \
    --lora_r 32 \
    --target_modules q_proj k_proj v_proj up_proj down_proj \
    --lora_alpha 64 \
    --lora_dropout 0.1 \
    --nls \
    --nls_target_modules q_proj k_proj v_proj \
    --search_space 32 24 16

Some explanations about parameters related to the NLS strategy:

--nls: determines whether to apply Neural LoRA Search (NLS) or not. When set to True, the training process will include NLS, which is a technique designed to optimize the low-rank adaptation of the model.
--nls_target_modules: specifies which modules within the model will have the elastic LoRA adapter applied. The modules listed here will be the target of the NLS process.
--search_space: defines the low-rank search space for NLS training. It is a list of integers that represent the different ranks to be considered during the search process.

After completing the super-adapter training, the command to extract the heuristic sub-adapter is as follows. Additionally, more powerful sub-adapters can be obtained through other advanced search algorithms.

python utils/extract_sub_adapter.py \
  --adapter_model <path to super-adapter> \
  --elastic_adapter_config_file <path to super-adapter>/elastic_adapter_config.json \
  --adapter_version heuristic \
  --output_dir <path to sub-adapter>

Evaluation

lm_eval --model hf \
    --model_args pretrained=<path to quantized sparse base model>,peft=<path to sub-adapter>,add_bos_token=True \
    --tasks gsm8k \
    --batch_size auto:4 \
    --output_path result.json

We also provide a LoRA version (without the NLS strategy), allowing us to choose whether to enable NLS based on the actual situation.

LoRA

python run_sqft.py \
    --dataset_name gsm8k \
    --model_name_or_path <path to quantized sparse base model> \
    --do_train \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 3 \
    --learning_rate 1e-4 \
    --warmup_steps 100 \
    --lr_scheduler_type cosine \
    --optim adamw_torch \
    --fp16 \
    --output_dir <path to trained adapter> \
    --logging_steps 20 \
    --save_strategy epoch \
    --save_total_limit 2 \
    --lora_r 32 \
    --target_modules q_proj k_proj v_proj up_proj down_proj \
    --lora_alpha 64 \
    --lora_dropout 0.1 

lm_eval --model hf \
    --model_args pretrained=<path to quantized sparse base model>,peft=<path to trained adapter>,add_bos_token=True \
    --tasks gsm8k \
    --batch_size auto:4 \
    --output_path result.json

🛠️ SQFT + SparsePEFT

Fine-tuning

python run_sqft.py \
    --dataset_name gsm8k \
    --model_name_or_path <path to sparse base model> \
    --do_train \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 3 \
    --learning_rate 3e-4 \
    --warmup_steps 100 \
    --lr_scheduler_type cosine \
    --optim adamw_torch \
    --fp16 \
    --output_dir <path to super-adapter> \
    --logging_steps 20 \
    --save_strategy epoch \
    --save_total_limit 2 \
    --lora_r 32 \
    --target_modules q_proj k_proj v_proj up_proj down_proj \
    --lora_alpha 64 \
    --lora_dropout 0.1 \
    --nls \
    --nls_target_modules q_proj k_proj v_proj \
    --search_space 32 24 16 \
    --sparse_adapter  # enable SparsePEFT

Extract the heuristic sub-adapter:

python utils/extract_sub_adapter.py --adapter_model <path to super-adapter> --elastic_adapter_config_file <path to super-adapter>/elastic_adapter_config.json --adapter_version heuristic --output_dir <path to sub-adapter>

Merge the adapter to the base model and check the sparsity of the merged model:

python utils/merge.py --base_model_path <path to sparse base model> --adapter_model_path <path to sub-adapter> --output_path <path to merged model>
python utils/check_sparsity.py --model_path <path to merged model>

Evaluation

lm_eval --model hf \
    --model_args pretrained=<path to merged model>,add_bos_token=True \
    --tasks gsm8k \
    --batch_size auto:4 \
    --output_path result.json

LoRA

python run_sqft.py \
    --dataset_name gsm8k \
    --model_name_or_path <path to sparse base model> \
    --do_train \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 3 \
    --learning_rate 3e-4 \
    --warmup_steps 100 \
    --lr_scheduler_type cosine \
    --optim adamw_torch \
    --fp16 \
    --output_dir <path to trained adapter> \
    --logging_steps 20 \
    --save_strategy epoch \
    --save_total_limit 2 \
    --lora_r 32 \
    --target_modules q_proj k_proj v_proj up_proj down_proj \
    --lora_alpha 64 \
    --lora_dropout 0.1 \
    --sparse_adapter  # enable SparsePEFT

python utils/merge.py --base_model_path <path to sparse base model> --adapter_model_path <path to trained adapter> --output_path <path to merged model>
python utils/check_sparsity.py --model_path <path to merged model>

lm_eval --model hf \
    --model_args pretrained=<path to merged model>,add_bos_token=True \
    --tasks gsm8k \
    --batch_size auto:4 \
    --output_path result.json

🛠️ SQFT + QA-SparsePEFT

Fine-tuning

python run_sqft.py \
    --dataset_name gsm8k \
    --model_name_or_path <path to quantized sparse base model> \
    --non_quant_model_name_or_path <path to sparse base model> \
    --do_train \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 4 \
    --learning_rate 3e-4 \
    --warmup_steps 100 \
    --lr_scheduler_type cosine \
    --optim adamw_torch \
    --fp16 \
    --output_dir <path to super-adapter> \
    --logging_steps 20 \
    --save_strategy epoch \
    --save_total_limit 2 \
    --lora_r 32 \
    --target_modules q_proj k_proj v_proj up_proj down_proj \
    --lora_alpha 64 \
    --lora_dropout 0.1 \
    --nls \
    --nls_target_modules q_proj k_proj v_proj \
    --search_space 32 24 16 \
    --sparse_adapter \
    --quantization_aware # enable quantization-aware SparsePEFT

Extract the heuristic sub-adapter:

python utils/extract_sub_adapter.py --adapter_model <path to super-adapter> --elastic_adapter_config_file <path to super-adapter>/elastic_adapter_config.json --adapter_version heuristic --output_dir <path to sub-adapter>

Merge the adapter to the quantized base model and check the sparsity of the merged model:

python utils/merge.py \
  --base_model_path <path to quantized sparse base model> \
  --non_quant_base_model_path <path to sparse base model> \
  --adapter_model_path <path to sub-adapter> \
  --output_path <path to merged model>

python utils/check_sparsity.py --model_path <path to merged model>

Evaluation

lm_eval --model hf \
    --model_args pretrained=<path to merged model>,add_bos_token=True \
    --tasks gsm8k \
    --batch_size auto:4 \
    --output_path result.json

LoRA

python run_sqft.py \
    --dataset_name gsm8k \
    --model_name_or_path <path to quantized sparse base model> \
    --non_quant_model_name_or_path <path to sparse base model> \
    --do_train \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 4 \
    --learning_rate 3e-4 \
    --warmup_steps 100 \
    --lr_scheduler_type cosine \
    --optim adamw_torch \
    --fp16 \
    --output_dir <path to trained adapter> \
    --logging_steps 20 \
    --save_strategy epoch \
    --save_total_limit 2 \
    --lora_r 32 \
    --target_modules q_proj k_proj v_proj up_proj down_proj \
    --lora_alpha 64 \
    --lora_dropout 0.1 \
    --sparse_adapter \
    --quantization_aware # enable quantization-aware SparsePEFT

python utils/merge.py \
  --base_model_path <path to quantized sparse base model> \
  --non_quant_base_model_path <path to sparse base model> \
  --adapter_model_path <path to trained adapter> \
  --output_path <path to merged model>

python utils/check_sparsity.py --model_path <path to merged model>

lm_eval --model hf \
    --model_args pretrained=<path to merged model>,add_bos_token=True \
    --tasks gsm8k \
    --batch_size auto:4 \
    --output_path result.json

Released Fine-tuned Models 🤗

Mistral-7B-v0.3

Base Model	Task	Method	Fine-tuned Model
sqft-mistral-7b-v0.3-50-base	GSM8K	SQFT + SparsePEFT	sqft-sparsepeft-mistral-7b-v0.3-50-gsm8k-heu
sqft-mistral-7b-v0.3-50-base-gptq	GSM8K	SQFT	sqft-mistral-7b-v0.3-50-gptq-gsm8k-heu-adapter
sqft-mistral-7b-v0.3-50-base-gptq	GSM8K	SQFT + QA-SparsePEFT	sqft-qa-sparsepeft-mistral-7b-v0.3-50-gptq-gsm8k-heu
sqft-mistral-7b-v0.3-50-base	Math	SQFT + SparsePEFT	sqft-sparsepeft-mistral-7b-v0.3-50-math-heu
sqft-mistral-7b-v0.3-50-base-gptq	Math	SQFT	sqft-mistral-7b-v0.3-50-gptq-math-heu-adapter
sqft-mistral-7b-v0.3-50-base-gptq	Math	SQFT + QA-SparsePEFT	sqft-qa-sparsepeft-mistral-7b-v0.3-50-gptq-math-heu

Phi-3-mini-4k-instruct

Base Model	Task	Method	Fine-tuned Model
sqft-phi-3-mini-4k-50-base	Math	SQFT + SparsePEFT	sqft-sparsepeft-phi-3-mini-4k-50-math-heu
sqft-phi-3-mini-4k-50-base-gptq	Math	SQFT	sqft-phi-3-mini-4k-50-gptq-math-heu-adapter
sqft-phi-3-mini-4k-50-base-gptq	Math	SQFT + QA-SparsePEFT	sqft-qa-sparsepeft-phi-3-mini-4k-50-gptq-math-heu
sqft-phi-3-mini-4k-50-base	CS	SQFT + SparsePEFT	sqft-sparsepeft-phi-3-mini-4k-50-cs-heu
sqft-phi-3-mini-4k-50-base-gptq	CS	SQFT	sqft-phi-3-mini-4k-50-gptq-cs-heu-adapter
sqft-phi-3-mini-4k-50-base-gptq	CS	SQFT + QA-SparsePEFT	sqft-qa-sparsepeft-phi-3-mini-4k-50-gptq-cs-heu

Meta-Llama-3-8B

Base Model	Task	Method	Fine-tuned Model
sqft-llama-3-8b-50-base^*	GSM8K	SQFT + SparsePEFT	sqft-sparsepeft-llama-3-8b-50-gsm8k-heu
sqft-llama-3-8b-50-base-gptq	GSM8K	SQFT	sqft-llama-3-8b-50-gptq-gsm8k-heu-adapter
sqft-llama-3-8b-50-base-gptq	GSM8K	SQFT + QA-SparsePEFT	sqft-qa-sparsepeft-llama-3-8b-50-gptq-gsm8k-heu

^* Llama-3 models are currently under internal review and will be released soon.

Citation

If you find SQFT's code and paper helpful, please kindly cite:

@inproceedings{munoz-etal-2024-sqft,
    title = "{SQFT}: Low-cost Model Adaptation in Low-precision Sparse Foundation Models",
    author = "Munoz, Juan Pablo  and
      Yuan, Jinjie  and
      Jain, Nilesh",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.749",
    pages = "12817--12832",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQFT

SQFT

README.md

SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models

News

Released Foundation Models 🤗

Setup

Quick start

Training and Evaluation

Sparsification

Quantization

🛠️ SQFT

🛠️ SQFT + SparsePEFT

🛠️ SQFT + QA-SparsePEFT

Released Fine-tuned Models 🤗

Citation

Files

SQFT

Directory actions

More options

Directory actions

More options

Latest commit

History

SQFT

Folders and files

parent directory

README.md

SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models

News

Released Foundation Models 🤗

Setup

Quick start

Training and Evaluation

Sparsification

Quantization

🛠️ SQFT

🛠️ SQFT + SparsePEFT

🛠️ SQFT + QA-SparsePEFT

Released Fine-tuned Models 🤗

Citation