Skip to content

larbi1512/RL-for-LLms-optimization

Repository files navigation

RL-for-LLMs-Optimization

This repository explores Reinforcement Learning (RL) strategies to optimize Large Language Models (LLMs) with a focus on efficiency and resource savings. The project demonstrates both prompt-level and model-level optimization via pruning and quantization, using a combination of custom RL environments and agents.


Contents

  • Prompt_prunning.ipynb: RL-driven approach to prompt pruning for LLMs.
  • rl-llm-optimization-hybridApproach.ipynb: Hybrid RL environment combining pruning and quantization for compressing LLMs.
  • rl-llm-prunning.ipynb: RL-based attention head pruning in transformer models for size and latency reduction.

Project Overview

1. Prompt Pruning (Prompt_prunning.ipynb)

  • Focus: Uses RL to prune tokens from prompts fed to LLMs, with the goal of reducing computation while preserving output quality.
  • Approach:
    • Extracts features such as token saliency, attention entropy, and reconstruction error.
    • An RL agent learns to select the most critical tokens, balancing resource savings and output similarity.
  • Techniques: Deep RL, autoencoders for feature extraction, reward engineering combining speed, memory, and output similarity.

2. Hybrid Model Compression (rl-llm-optimization-hybridApproach.ipynb)

  • Focus: Optimizes LLMs by learning both pruning rates and quantization bit-widths per layer with RL.
  • Approach:
    • Custom Gymnasium environment simulates incremental model compression.
    • PPO agent (PyTorch) learns policies to maximize compression while minimizing loss in performance.
    • Tracks and rewards based on perplexity, FLOPs, parameter count, and overall resource savings.
  • Techniques: Pruning (magnitude-based), quantization (min-max), curriculum learning, experiment tracking (WandB).

3. Attention Head Pruning (rl-llm-prunning.ipynb)

  • Focus: Uses RL to determine optimal numbers of attention heads to keep per transformer layer, trading off between model quality and efficiency.
  • Approach:
    • Defines a Gym environment as a "game" where the agent's actions are pruning decisions.
    • Rewards combine perplexity (quality) with latency and memory savings.
    • Trains policy and value networks with PPO; includes tools to prune, save, and evaluate compressed models.
  • Techniques: PPO, custom reward engineering, evaluation via perplexity.

Installation

pip install torch transformers datasets sentencepiece gymnasium matplotlib
# (Optional for hybrid compression) pip install wandb

Some notebooks require GPU and CUDA drivers for full functionality.


Usage

Clone the repository:

git clone https://github.com/larbi1512/RL-for-LLms-optimization.git
cd RL-for-LLms-optimization

Open the notebook of interest (.ipynb) in Jupyter or VS Code and follow the step-by-step instructions.

  • For prompt pruning, see Prompt_prunning.ipynb.
  • For hybrid model compression, see rl-llm-optimization-hybridApproach.ipynb.
  • For attention head pruning, see rl-llm-prunning.ipynb.

Requirements

  • Python 3.8+
  • PyTorch
  • HuggingFace Transformers
  • Datasets
  • Gymnasium
  • Matplotlib
  • (Optional) WandB, CUDA-capable GPU

License

See the LICENSE file for details.


Acknowledgements

  • HuggingFace for model and dataset APIs.
  • OpenAI Gymnasium for RL environment scaffolding.
  • The open-source RL and transformers communities.

Contact

For questions, please open an issue or reach out to the repository maintainer.

About

Reinforcement Learning for optimizing LLms inference

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •