-
Notifications
You must be signed in to change notification settings - Fork 809
Open
Description
Proposal to Add PPO + Mamba SSM to CleanRL
Issue Type
- Bug Report
- Feature Request / Algorithm Implementation
- Documentation
- Question
Overview
This issue proposes adding a new algorithm implementation: PPO integrated with the Mamba State-Space Model (SSM). Mamba is a recent neural network architecture designed for efficient sequence modeling, demonstrating significant improvements in computational speed and memory usage over traditional recurrent models (LSTM, GRU) and Transformer-based models such as the existing ppo_trxl implementation.
Motivation
- Faster training speeds
- Lower GPU memory usage compared to Transformer-XL
- Good performance in POMDP and memory-based tasks
Environments Tested
- ProofOfMemory Environments
- MiniGrid (Memory, DoorKey)
- Classical Control (masked): LunarLander, CartPole
- MuJoCo continuous control tasks: HalfCheetah, Walker2d, Hopper
- Atari games: Breakout, Pong
Current Behavior
- No current implementation of PPO + Mamba SSM in CleanRL
Expected Behavior
- Implementation of PPO integrated with Mamba SSM
- Performance benchmarks showing training efficiency and final performance
- Documentation on usage and hyperparameter tuning
Planned Contribution Steps
- Complete comprehensive benchmarks following CleanRL's contribution guidelines
- Utilize the RLops utility for rigorous performance and regression analysis
- Update the CleanRL documentation to clearly present benchmark results and learning curves
- Submit PR with implementation and documentation
I already have experiments in these environments and also the comparison of efficiency metrics. I tested both Mamba and Mamba-2 models.
@vwxyzjn , I plan to follow the way ppo_trxl was integrated. What do you think?
Metadata
Metadata
Assignees
Labels
No labels