Name		Name	Last commit message	Last commit date
parent directory ..
extract		extract
patches		patches
recovery		recovery
results		results
README.md		README.md
eval.py		eval.py
install.sh		install.sh
requirements.txt		requirements.txt
run_multipruner.py		run_multipruner.py
utils.py		utils.py

README.md

MultiPruner

Official implementation of Fine-Grained Training-Free Structure Removal in Foundation Models.

This repo contains the code for MultiPruner, a novel pruning approach that surpasses recent training-free pruning methods, e.g., BlockPruner (Zhong el al., 2024) and ShortGPT (Men et al., 2024), by adopting a multidimensional, iterative, fine-grained pruning strategy. Please refer to our paper for more details.

News

[2024.12.14] Release the code for MultiPruner. 🎉

Supported Models 🤗

Llama
Qwen
Baichuan
- baichuan-inc/Baichuan2-7B-Base
- baichuan-inc/Baichuan2-13B-Base

All pruning result configurations and pruning commands are available here.

Setup

Use the following instructions to create a virtual environment with the required dependencies.

# install dependencies
bash install.sh

Run

We use meta-llama/Llama-2-7b-hf model as an example.

Prune

python run_multipruner.py \
  --model_path meta-llama/Llama-2-7b-hf \
  --output_path <path to pruning results> \
  --weight_reorder \
  --do_prune \
  --target_ratio 22.00 \
  --pruning_distribution 44:52:4 \
  --mlp_channel_group_size 1024 \
  --attn_channel_group_size 128 \
  --importance_metric ppl \
  --calibration_dataset alpaca \
  --num_calibration_samples_block 256 \
  --num_calibration_samples_width 128 \
  --do_eval

model_path: Path to the pre-trained model.
output_path: Directory to save the pruning and evaluation results.
weight_reorder: Indicates that weight reordering should be performed in Attn and MLP.
do_prune: Flag to indicate whether to perform pruning.
target_ratio: Target pruning ratio.
pruning_distribution: Pruning ratio distribution for different granularities.
mlp_channel_group_size: Number of channels for each group (MLP).
attn_channel_group_size: Number of channels for each group (Attn), generally a multiple of the head dimension.
importance_metric: Metric for calculating block importance, currently only supports PPL.
calibration_dataset: Calibration dataset name ("alpaca", "c4", "ptb" or "wikitext2").
num_calibration_samples_block: Number of calibration samples to use for depth (block) pruning (stage 1).
num_calibration_samples_width: Number of calibration samples to use for width pruning (stage 2 and 3).
do_eval: Flag to indicate whether to perform evaluation.

Extract the Compressed Model

The final compressed model can be extracted based on the optimal pruning configuration obtained from MultiPruner. For more details, please refer to this link. Below is an example of how to extract a pruned Llama-2-7B:

python extract/extract_model.py \
  --model_path meta-llama/Llama-2-7b-hf \
  --weight_reorder \
  --pruned_model_config_file <path to pruning results>/pruning_config.json \
  --output_path <path to compressed model>

Recovery Finetuning

After we have obtained the pruned model, we can use the Alpaca dataset for recovery fine-tuning. More details can be found here. The following is an example command for the compressed Llama-2-7B:

# Finetune the compressed model
python recovery/finetune.py \
  --model_path <path to compressed model> \
  --do_train \
  --batch_size 8 \
  --gradient_accumulation_steps 4 \
  --num_train_epochs 2 \
  --learning_rate 1e-4 \
  --lora \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --output_path <path to finetuned compressed model> \
  --do_eval

Results

We have provided some running commands (including pruning and recovery-tuning) and pruning configurations of MultiPruner, which can be found here.

Llama-3.2-3B

Method	Pruning Ratio	Acc. (%)	WikiText2 PPL
Dense	/	67.67	7.81
BlockPruner	9%	62.31	13.07
MultiPruner	9%	64.04	10.46

Llama-3.1-8B

Method	Pruning Ratio	Acc. (%)	WikiText2 PPL
Dense	/	73.75	6.24
BlockPruner	10%	66.75	10.58
MultiPruner	10%	69.27	8.93
BlockPruner	20%	59.08	15.37
MultiPruner	20%	63.07	13.86

Compared to LLM-Pruner (Pruning ratio: ~17%):

Method	WikiText2 PPL (Seq Len: 2048)
Dense	6.24
BlockPruner	13.78
LLM-Pruner (L2)	49.09
LLM-Pruner (Taylor)	12.71
MultiPruner (10:90:0)	11.64

Meta-Llama-3-8B

Method	Pruning Ratio	Acc. (%)	WikiText2 PPL
Dense	/	72.73	6.14
BlockPruner	10%	66.46	10.88
MultiPruner	10%	69.03	8.19
BlockPruner	20%	57.59	22.36
MultiPruner	20%	63.02	16.01

Compared to LLM-Pruner (Pruning ratio: ~17%):

Method	WikiText2 PPL (Seq Len: 2048)
Dense	6.14
BlockPruner	16.15
LLM-Pruner (L2)	34.13
LLM-Pruner (Taylor)	12.86
MultiPruner (10:90:0)	11.11

Qwen2.5-7B

Method	Pruning Ratio	Acc. (%)	WikiText2 PPL
Dense	/	72.04	6.85
BlockPruner	10%	67.44	9.88
MultiPruner	10%	69.71	9.15
BlockPruner	20%	57.44	17.17
MultiPruner	20%	62.82	13.37

For additional results and discussions on other models, please refer to the paper.

In addition, we also explored pruning ratios that result in 1%, 2%, and 3% accuracy degradation (compared to Dense), under both without finetune and with finetune scenarios. This investigation may facilitate practical applications. The results of Llama-2-7B are shown in the following table:

Method	Pruning Ratio	Acc. (%)	Acc. Drop	Relative Acc.
Dense	/	68.96	/	100%
MultiPruner w/o finetune	7%	67.94	-1.02%	98.52%
MultiPruner w/o finetune	10%	67.02	-1.94%	97.19%
MultiPruner w/o finetune	14%	65.93	-3.03%	95.61%
MultiPruner w/ finetune	12%	68.28	-0.68%	99.01%
MultiPruner w/ finetune	15%	67.41	-1.55%	97.75%
MultiPruner w/ finetune	18%	66.16	-2.80%	95.94%

In all tables, Acc.(%) represents the average accuracy score across the five tasks: piqa, winogrande, hellaswag, arc_easy, and arc_challenge.

Loading the compressed model for evaluation

python eval.py --model_path <path to compressed model> --output_path <path to evaluation results>

Acknowledgement

MultiPruner benefits from the following work:

@article{zhong2024blockpruner,
  title={BlockPruner: Fine-grained Pruning for Large Language Models},
  author={Zhong, Longguang and Wan, Fanqi and Chen, Ruijun and Quan, Xiaojun and Li, Liangzhi},
  journal={arXiv preprint arXiv:2406.10594},
  year={2024}
}

Citation

If you find MultiPruner's code and paper helpful, please kindly cite:

@article{munoz2025multipruner,
  title = {Fine-Grained Training-Free Structure Removal in Foundation Models},
  author = {J. Pablo Munoz and Jinjie Yuan and Nilesh Jain},
  year = {2025},
  url = {}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiPruner

MultiPruner

README.md

MultiPruner

News

Supported Models 🤗

Setup

Run

Prune

Extract the Compressed Model

Recovery Finetuning

Results

Llama-3.2-3B

Llama-3.1-8B

Meta-Llama-3-8B

Qwen2.5-7B

Loading the compressed model for evaluation

Acknowledgement

Citation

Files

MultiPruner

Directory actions

More options

Directory actions

More options

Latest commit

History

MultiPruner

Folders and files

parent directory

README.md

MultiPruner

News

Supported Models 🤗

Setup

Run

Prune

Extract the Compressed Model

Recovery Finetuning

Results

Llama-3.2-3B

Llama-3.1-8B

Meta-Llama-3-8B

Qwen2.5-7B

Loading the compressed model for evaluation

Acknowledgement

Citation