This repository explores Reinforcement Learning (RL) strategies to optimize Large Language Models (LLMs) with a focus on efficiency and resource savings. The project demonstrates both prompt-level and model-level optimization via pruning and quantization, using a combination of custom RL environments and agents.
Prompt_prunning.ipynb
: RL-driven approach to prompt pruning for LLMs.rl-llm-optimization-hybridApproach.ipynb
: Hybrid RL environment combining pruning and quantization for compressing LLMs.rl-llm-prunning.ipynb
: RL-based attention head pruning in transformer models for size and latency reduction.
- Focus: Uses RL to prune tokens from prompts fed to LLMs, with the goal of reducing computation while preserving output quality.
- Approach:
- Extracts features such as token saliency, attention entropy, and reconstruction error.
- An RL agent learns to select the most critical tokens, balancing resource savings and output similarity.
- Techniques: Deep RL, autoencoders for feature extraction, reward engineering combining speed, memory, and output similarity.
- Focus: Optimizes LLMs by learning both pruning rates and quantization bit-widths per layer with RL.
- Approach:
- Custom Gymnasium environment simulates incremental model compression.
- PPO agent (PyTorch) learns policies to maximize compression while minimizing loss in performance.
- Tracks and rewards based on perplexity, FLOPs, parameter count, and overall resource savings.
- Techniques: Pruning (magnitude-based), quantization (min-max), curriculum learning, experiment tracking (WandB).
- Focus: Uses RL to determine optimal numbers of attention heads to keep per transformer layer, trading off between model quality and efficiency.
- Approach:
- Defines a Gym environment as a "game" where the agent's actions are pruning decisions.
- Rewards combine perplexity (quality) with latency and memory savings.
- Trains policy and value networks with PPO; includes tools to prune, save, and evaluate compressed models.
- Techniques: PPO, custom reward engineering, evaluation via perplexity.
pip install torch transformers datasets sentencepiece gymnasium matplotlib
# (Optional for hybrid compression) pip install wandb
Some notebooks require GPU and CUDA drivers for full functionality.
Clone the repository:
git clone https://github.com/larbi1512/RL-for-LLms-optimization.git
cd RL-for-LLms-optimization
Open the notebook of interest (.ipynb
) in Jupyter or VS Code and follow the step-by-step instructions.
- For prompt pruning, see
Prompt_prunning.ipynb
. - For hybrid model compression, see
rl-llm-optimization-hybridApproach.ipynb
. - For attention head pruning, see
rl-llm-prunning.ipynb
.
- Python 3.8+
- PyTorch
- HuggingFace Transformers
- Datasets
- Gymnasium
- Matplotlib
- (Optional) WandB, CUDA-capable GPU
See the LICENSE file for details.
- HuggingFace for model and dataset APIs.
- OpenAI Gymnasium for RL environment scaffolding.
- The open-source RL and transformers communities.
For questions, please open an issue or reach out to the repository maintainer.