Skip to content

Latest commit

 

History

History

MultiPruner

Official implementation of Fine-Grained Training-Free Structure Removal in Foundation Models.

This repo contains the code for MultiPruner, a novel pruning approach that surpasses recent training-free pruning methods, e.g., BlockPruner (Zhong el al., 2024) and ShortGPT (Men et al., 2024), by adopting a multidimensional, iterative, fine-grained pruning strategy. Please refer to our paper for more details.

News

  • [2024.12.14] Release the code for MultiPruner. 🎉

Supported Models 🤗

All pruning result configurations and pruning commands are available here.

Setup

Use the following instructions to create a virtual environment with the required dependencies.

# install dependencies
bash install.sh

Run

We use meta-llama/Llama-2-7b-hf model as an example.

Prune

python run_multipruner.py \
  --model_path meta-llama/Llama-2-7b-hf \
  --output_path <path to pruning results> \
  --weight_reorder \
  --do_prune \
  --target_ratio 22.00 \
  --pruning_distribution 44:52:4 \
  --mlp_channel_group_size 1024 \
  --attn_channel_group_size 128 \
  --importance_metric ppl \
  --calibration_dataset alpaca \
  --num_calibration_samples_block 256 \
  --num_calibration_samples_width 128 \
  --do_eval
  • model_path: Path to the pre-trained model.
  • output_path: Directory to save the pruning and evaluation results.
  • weight_reorder: Indicates that weight reordering should be performed in Attn and MLP.
  • do_prune: Flag to indicate whether to perform pruning.
  • target_ratio: Target pruning ratio.
  • pruning_distribution: Pruning ratio distribution for different granularities.
  • mlp_channel_group_size: Number of channels for each group (MLP).
  • attn_channel_group_size: Number of channels for each group (Attn), generally a multiple of the head dimension.
  • importance_metric: Metric for calculating block importance, currently only supports PPL.
  • calibration_dataset: Calibration dataset name ("alpaca", "c4", "ptb" or "wikitext2").
  • num_calibration_samples_block: Number of calibration samples to use for depth (block) pruning (stage 1).
  • num_calibration_samples_width: Number of calibration samples to use for width pruning (stage 2 and 3).
  • do_eval: Flag to indicate whether to perform evaluation.

Extract the Compressed Model

The final compressed model can be extracted based on the optimal pruning configuration obtained from MultiPruner. For more details, please refer to this link. Below is an example of how to extract a pruned Llama-2-7B:

python extract/extract_model.py \
  --model_path meta-llama/Llama-2-7b-hf \
  --weight_reorder \
  --pruned_model_config_file <path to pruning results>/pruning_config.json \
  --output_path <path to compressed model>

Recovery Finetuning

After we have obtained the pruned model, we can use the Alpaca dataset for recovery fine-tuning. More details can be found here. The following is an example command for the compressed Llama-2-7B:

# Finetune the compressed model
python recovery/finetune.py \
  --model_path <path to compressed model> \
  --do_train \
  --batch_size 8 \
  --gradient_accumulation_steps 4 \
  --num_train_epochs 2 \
  --learning_rate 1e-4 \
  --lora \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --output_path <path to finetuned compressed model> \
  --do_eval

Results

We have provided some running commands (including pruning and recovery-tuning) and pruning configurations of MultiPruner, which can be found here.

Llama-3.2-3B

Method Pruning Ratio Acc. (%) WikiText2 PPL
Dense / 67.67 7.81
BlockPruner 9% 62.31 13.07
MultiPruner 9% 64.04 10.46

Llama-3.1-8B

Method Pruning Ratio Acc. (%) WikiText2 PPL
Dense / 73.75 6.24
BlockPruner 10% 66.75 10.58
MultiPruner 10% 69.27 8.93
BlockPruner 20% 59.08 15.37
MultiPruner 20% 63.07 13.86

Compared to LLM-Pruner (Pruning ratio: ~17%):

Method WikiText2 PPL (Seq Len: 2048)
Dense 6.24
BlockPruner 13.78
LLM-Pruner (L2) 49.09
LLM-Pruner (Taylor) 12.71
MultiPruner (10:90:0) 11.64

Meta-Llama-3-8B

Method Pruning Ratio Acc. (%) WikiText2 PPL
Dense / 72.73 6.14
BlockPruner 10% 66.46 10.88
MultiPruner 10% 69.03 8.19
BlockPruner 20% 57.59 22.36
MultiPruner 20% 63.02 16.01

Compared to LLM-Pruner (Pruning ratio: ~17%):

Method WikiText2 PPL (Seq Len: 2048)
Dense 6.14
BlockPruner 16.15
LLM-Pruner (L2) 34.13
LLM-Pruner (Taylor) 12.86
MultiPruner (10:90:0) 11.11

Qwen2.5-7B

Method Pruning Ratio Acc. (%) WikiText2 PPL
Dense / 72.04 6.85
BlockPruner 10% 67.44 9.88
MultiPruner 10% 69.71 9.15
BlockPruner 20% 57.44 17.17
MultiPruner 20% 62.82 13.37

For additional results and discussions on other models, please refer to the paper.

In addition, we also explored pruning ratios that result in 1%, 2%, and 3% accuracy degradation (compared to Dense), under both without finetune and with finetune scenarios. This investigation may facilitate practical applications. The results of Llama-2-7B are shown in the following table:

Method Pruning Ratio Acc. (%) Acc. Drop Relative Acc.
Dense / 68.96 / 100%
MultiPruner w/o finetune 7% 67.94 -1.02% 98.52%
MultiPruner w/o finetune 10% 67.02 -1.94% 97.19%
MultiPruner w/o finetune 14% 65.93 -3.03% 95.61%
MultiPruner w/ finetune 12% 68.28 -0.68% 99.01%
MultiPruner w/ finetune 15% 67.41 -1.55% 97.75%
MultiPruner w/ finetune 18% 66.16 -2.80% 95.94%

In all tables, Acc.(%) represents the average accuracy score across the five tasks: piqa, winogrande, hellaswag, arc_easy, and arc_challenge.

Loading the compressed model for evaluation

python eval.py --model_path <path to compressed model> --output_path <path to evaluation results>

Acknowledgement

MultiPruner benefits from the following work:

@article{zhong2024blockpruner,
  title={BlockPruner: Fine-grained Pruning for Large Language Models},
  author={Zhong, Longguang and Wan, Fanqi and Chen, Ruijun and Quan, Xiaojun and Li, Liangzhi},
  journal={arXiv preprint arXiv:2406.10594},
  year={2024}
}

Citation

If you find MultiPruner's code and paper helpful, please kindly cite:

@article{munoz2025multipruner,
  title = {Fine-Grained Training-Free Structure Removal in Foundation Models},
  author = {J. Pablo Munoz and Jinjie Yuan and Nilesh Jain},
  year = {2025},
  url = {}
}