-
Notifications
You must be signed in to change notification settings - Fork 2k
Add NStepReplayBuffer and n_steps arguments for off-policy algorithms
#2144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new NStepReplayBuffer to compute n-step returns and accompanying tests to validate its behavior under various termination conditions.
- Implements
NStepReplayBuffersubclass with overridden_get_samplesto handle multi-step returns, terminations, and truncations. - Provides unit tests in
tests/test_n_step_replay.pycovering normal sampling, early termination, and truncation. - Integrates
NStepReplayBufferinto DQN/SAC viatest_rundemonstration.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_n_step_replay.py | Added comprehensive tests for n-step buffer behavior |
| stable_baselines3/common/buffers.py | Introduced NStepReplayBuffer with custom sampling logic |
Comments suppressed due to low confidence (2)
stable_baselines3/common/buffers.py:843
- Add a class-level docstring for
NStepReplayBufferto describe its purpose, parameters (n_steps,gamma), and behavior (handling terminations and truncations).
class NStepReplayBuffer(ReplayBuffer):
tests/test_n_step_replay.py:10
- The tests cover single-environment behavior but don’t validate multi-environment buffers. Add a test with
n_envs > 1to ensure correct sampling and offset handling across parallel envs.
@pytest.mark.parametrize("model_class", [SAC, DQN])
|
Hey araffin, I've tested this version and it works for me both code level and I think it is right on theoretical level. I like the solution that we are keeping the original 'data backend', and we just modify how we sample from the buffer, afaik most of the other implementations using deque to keep a separate line of data, which could be heavy with multiple environments and hard to keep the bookkeeping. UPDATE: UPDATE2: Now I can see that it is dynamic sorry, it was confusing to me, the mask part already incorporating this mechanism. LGTM |
|
Thanks a lot @richardjozsa for reviewing =)
Good catch, there is an issue with the discount factor used for bootstrapping ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for n-step returns in off-policy algorithms by introducing an NStepReplayBuffer and a new n_steps parameter.
- Integrate
NStepReplayBufferandn_stepsoption across DQN, SAC, TD3, and DDPG - Update training loops to use dynamic discount factors (
replay_data.discounts) - Add tests for n-step buffer behavior, bump version, and update changelog
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| stable_baselines3/common/buffers.py | Implement NStepReplayBuffer with on-the-fly return computation |
| stable_baselines3/common/off_policy_algorithm.py | Add n_steps param and choose NStepReplayBuffer when >1 |
| stable_baselines3/dqn/dqn.py | Propagate n_steps and use discounts in target computation |
| stable_baselines3/sac/sac.py | Same propagation and target update |
| stable_baselines3/td3/td3.py | Same propagation and target update |
| stable_baselines3/ddpg/ddpg.py | Same propagation and target update |
| stable_baselines3/common/type_aliases.py | Add discounts field to sample tuple |
| stable_baselines3/version.txt | Bump version to 2.7.0a0 |
| docs/misc/changelog.rst | Document n-step support |
| tests/test_n_step_replay.py | New tests for n-step returns |
| tests/test_buffers.py | Allow None for discounts in buffer tests |
Comments suppressed due to low confidence (2)
tests/test_n_step_replay.py:8
- [nitpick] Missing test for
NStepReplayBufferwithoptimize_memory_usage=True. Add a test asserting that initializing the buffer withoptimize_memory_usage=TrueraisesNotImplementedError.
import pytest
stable_baselines3/common/off_policy_algorithm.py:54
- [nitpick] The
n_stepsparam doc should clarify thatn_steps=1preserves standard (1-step) behavior and note thatoptimize_memory_usageis not supported whenn_steps>1.
:param n_steps: When n_step > 1, uses n-step return (with the NStepReplayBuffer) when updating the Q-value network.
NStepReplayBuffer and n_steps arguments for off-policy algorithms
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good!
|
Some benchmark results. SAC - LunarLanderContinuous-v310 seeds BipedalWalker-v33 seeds LunarLander and BipedalWalker combined: Swimmer-v3Note: only 3 seeds |




Description
Reviving #81
closes #47
Idea based on https://github.com/younggyoseo/FastTD3 implementation with fixes to avoid younggyoseo/FastTD3#6
Mostly tested on IsaacSim so far.
To try it out (I'm thinking about adding a
n_stepsparam to off-policy algorithm to make it easier):Motivation and Context
Types of changes
Checklist
make format(required)make check-codestyleandmake lint(required)make pytestandmake typeboth pass. (required)make doc(required)Note: You can run most of the checks using
make commit-checks.Note: we are using a maximum length of 127 characters per line