Skip to content

Official repository for the paper: "You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning" [ICLR 2025]

License

Notifications You must be signed in to change notification settings

LCS2-IIITD/PruneNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PruneNet: Calibration-Free Model Compression

This repository contains the code for the paper You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning.

The paper introduces PruneNet, a novel structured-pruning technique which intrinsically prunes transformer models without relying on any calibration datasets. PruneNet works by slicing-off the unimportant rows from the weight matrices of FFN layers of these models, where the importance scores of the rows are computed using a two-layered neural network. The pruning process is modeled as a stochastic policy which is trained to preserve the spectral structure of the weight matrices using a standard RL-based pipeline.

Installation and requirements

We re-use many components from the SliceGPT pipeline. For this, we recommend a Python version >=3.10. To install the required components, run the following:

git clone https://github.com/microsoft/TransformerCompression
cd TransformerCompression
pip install -e .[experiment,finetune]
pip install git+https://github.com/pnnl/DDKS

Usage

The main scripts are in the prunenet directory. prunenet/prunenet.py is the script which trains a SparsityPredictor (the policy model) used to compute importance scores for rows of weight matrices, and the same script uses such a policy model to prune an LLM. prunenet/prunenet_utils.py contains some utility functions used throughout our implementation. prunenet/SparsityPredictor.py contains the PyTorch definition of the policy model.

Here is an example which prunes the facebook/opt-125m model for a compression ratio of 0.3. Most users should be able to run this example locally.

CUDA_VISIBLE_DEVICES=0 python3 -m prunenet                  \
    --model_name facebook/opt-125m                          \
    --compression_ratio 0.3                                 \
    --save_dir  /home/codetalker7/compressed_models/opt/    \
    --device cuda:0

This script will train the action model (if it doesn't already exist in the directory specified by --save_dir), save the action model, prune the model and save the weights of the pruned model. The trained action model can be re-used to compress other models as well.

Evaluation scripts

Slicing the attention modules

Citation

If you find our work useful in your projects/research, kindly cite our paper:

@inproceedings{
    sengupta2025you,
    title={You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning},
    author={Ayan Sengupta and Siddhant Chaudhary and Tanmoy Chakraborty},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=5RZoYIT3u6}
}

About

Official repository for the paper: "You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning" [ICLR 2025]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages