This repository contains the implementation and evaluation of various compression techniques for the Qwen2.5-3B and llama2-7B language model. The study explores different combinations of quantization, pruning, and fine-tuning methods to achieve efficient model deployment while maintaining performance.
Large language models (LLMs) have achieved remarkable performance across diverse domains. However, their large size poses challenges for deployment, especially in resource-constrained settings. In this work, we study systematic compression of the Qwen2.5-3B model via multiple techniques. Specifically, we leverage AutoAWQ for 4-bit quantization, ShortGPT for model pruning, and Low-Rank Adaptation (LoRA) for fine-tuning. We examine 16 different compression pipelines arising from combinations of these techniques. We evaluate each pipeline on domain-specific benchmarks (LawBench, MMLU-STEM, MMLU-Law) to study accuracy vs. compression trade-offs.
awq/
: Implementation of 4-bit quantization using AutoAWQprune_short_Gpt/
: Model pruning using ShortGPTllamafactory/
: LoRA fine-tuning implementationopencompass/
: Evaluation framework for model performancedataset/
: Domain-specific datasets for evaluationmodel/
: Compressed model checkpointsGPU_Use_Analysis/
: Analysis of GPU resource utilizationgptq/
: Additional quantization experiments
- Combining quantization with LoRA fine-tuning achieves optimal balance between compression and accuracy
- Aggressive pruning can negatively impact specialized task performance
- Different compression techniques show varying effectiveness across domains
# Clone the repository
git clone https://github.com/ryan0980/20250505_LLM_Workout_Plan.git
cd 20250505_LLM_Workout_Plan
# Install dependencies
pip install -r requirements.txt
The compressed models are available on Hugging Face Hub under the tusrau organization. The following models are available:
tusrau/q3bft_q_p
: fine-tuned, Quantized, and pruned modeltusrau/q3b_q_ft_p
: Quantized, fine-tuned, and pruned modeltusrau/q3b_p_ft_q
: Pruned, fine-tuned, and quantized modeltusrau/q3b_ft_p_q
: Fine-tuned, pruned, and quantized modeltusrau/q3bp_q
: Pruned and quantized modeltusrau/q3bft_q
: Fine-tuned and quantized modeltusrau/q3bft
: Fine-tuned modeltusrau/q3bq
: Quantized modeltusrau/q3b_p_ft
: Pruned and fine-tuned modeltusrau/q3bp
: Pruned model
The study evaluates 16 different compression pipelines on:
- LawBench
- MMLU-STEM
- MMLU-Law
Detailed results and analysis can be found in the paper (not yet published) or in the results folder.