Skip to content

Conversation

@araffin
Copy link
Member

@araffin araffin commented Jun 6, 2025

Description

Reviving #81

closes #47

Idea based on https://github.com/younggyoseo/FastTD3 implementation with fixes to avoid younggyoseo/FastTD3#6

Mostly tested on IsaacSim so far.

To try it out (I'm thinking about adding a n_steps param to off-policy algorithm to make it easier):

from stable_baselines3 import DQN, SAC
from stable_baselines3.common.buffers import NStepReplayBuffer, ReplayBuffer
from stable_baselines3.common.env_util import make_vec_env

model_class = SAC

env_id = "CartPole-v1" if model_class == DQN else "Pendulum-v1"
env = make_vec_env(env_id, n_envs=2)

model = model_class("MlpPolicy", env, n_steps=3)

model.learn(total_timesteps=150)

Motivation and Context

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have opened an associated PR on the SB3-Contrib repository (if necessary)
  • I have opened an associated PR on the RL-Zoo3 repository (if necessary)
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)
  • I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

@araffin araffin mentioned this pull request Jun 6, 2025
12 tasks
@araffin araffin requested a review from Copilot June 6, 2025 15:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new NStepReplayBuffer to compute n-step returns and accompanying tests to validate its behavior under various termination conditions.

  • Implements NStepReplayBuffer subclass with overridden _get_samples to handle multi-step returns, terminations, and truncations.
  • Provides unit tests in tests/test_n_step_replay.py covering normal sampling, early termination, and truncation.
  • Integrates NStepReplayBuffer into DQN/SAC via test_run demonstration.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/test_n_step_replay.py Added comprehensive tests for n-step buffer behavior
stable_baselines3/common/buffers.py Introduced NStepReplayBuffer with custom sampling logic
Comments suppressed due to low confidence (2)

stable_baselines3/common/buffers.py:843

  • Add a class-level docstring for NStepReplayBuffer to describe its purpose, parameters (n_steps, gamma), and behavior (handling terminations and truncations).
class NStepReplayBuffer(ReplayBuffer):

tests/test_n_step_replay.py:10

  • The tests cover single-environment behavior but don’t validate multi-environment buffers. Add a test with n_envs > 1 to ensure correct sampling and offset handling across parallel envs.
@pytest.mark.parametrize("model_class", [SAC, DQN])

@richardjozsa
Copy link

richardjozsa commented Jun 8, 2025

Hey araffin,

I've tested this version and it works for me both code level and I think it is right on theoretical level. I like the solution that we are keeping the original 'data backend', and we just modify how we sample from the buffer, afaik most of the other implementations using deque to keep a separate line of data, which could be heavy with multiple environments and hard to keep the bookkeeping.

UPDATE:
The point where I'm not sure of, in your code you calculate the discount for self.n_steps as constant, doesn't it have to be dynamic? Like for example we set n_step=5 and the epsiode gets 'done' in that 5 steps in the 3.rd step just as an example, I think reward has to be discounted with the number fo steps until the 'done' flag in the episode in this example, so gamma **3. of course, correct me if I'm wrong.

UPDATE2:

Now I can see that it is dynamic sorry, it was confusing to me, the mask part already incorporating this mechanism. LGTM

@araffin
Copy link
Member Author

araffin commented Jun 10, 2025

Thanks a lot @richardjozsa for reviewing =)

calculate the discount for self.n_steps as constant,

Good catch, there is an issue with the discount factor used for bootstrapping (model.gamma is kept constant even with shorter episodes).

@araffin araffin marked this pull request as ready for review June 11, 2025 08:48
@araffin araffin requested a review from Copilot June 11, 2025 08:48

This comment was marked as outdated.

@araffin araffin requested a review from Copilot June 11, 2025 08:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for n-step returns in off-policy algorithms by introducing an NStepReplayBuffer and a new n_steps parameter.

  • Integrate NStepReplayBuffer and n_steps option across DQN, SAC, TD3, and DDPG
  • Update training loops to use dynamic discount factors (replay_data.discounts)
  • Add tests for n-step buffer behavior, bump version, and update changelog

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
stable_baselines3/common/buffers.py Implement NStepReplayBuffer with on-the-fly return computation
stable_baselines3/common/off_policy_algorithm.py Add n_steps param and choose NStepReplayBuffer when >1
stable_baselines3/dqn/dqn.py Propagate n_steps and use discounts in target computation
stable_baselines3/sac/sac.py Same propagation and target update
stable_baselines3/td3/td3.py Same propagation and target update
stable_baselines3/ddpg/ddpg.py Same propagation and target update
stable_baselines3/common/type_aliases.py Add discounts field to sample tuple
stable_baselines3/version.txt Bump version to 2.7.0a0
docs/misc/changelog.rst Document n-step support
tests/test_n_step_replay.py New tests for n-step returns
tests/test_buffers.py Allow None for discounts in buffer tests
Comments suppressed due to low confidence (2)

tests/test_n_step_replay.py:8

  • [nitpick] Missing test for NStepReplayBuffer with optimize_memory_usage=True. Add a test asserting that initializing the buffer with optimize_memory_usage=True raises NotImplementedError.
import pytest

stable_baselines3/common/off_policy_algorithm.py:54

  • [nitpick] The n_steps param doc should clarify that n_steps=1 preserves standard (1-step) behavior and note that optimize_memory_usage is not supported when n_steps>1.
:param n_steps: When n_step > 1, uses n-step return (with the NStepReplayBuffer) when updating the Q-value network.

@araffin araffin changed the title Add NStepReplayBuffer Add NStepReplayBuffer and n_steps arguments for off-policy algorithms Jun 11, 2025
Copy link
Collaborator

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@araffin
Copy link
Member Author

araffin commented Jun 16, 2025

Some benchmark results.
Note: n-steps doesn't always improve performance (and usually n_steps=3 gives equal or better results)

SAC - LunarLanderContinuous-v3

 python train.py --algo sac --env LunarLanderContinuous-v3 -P --n-eval-envs 5 --eval-episodes 20 -param learning_rate:0.0005 -n 150000

10 seeds

python -m rl_zoo3.cli all_plots -f logs/ logs/n_steps_3/ -a sac -e LunarLanderContinuous --no-million -o logs/sac_n_steps -l DEFAULT N-STEPS-3

python -m rl_zoo3.cli plot_from_file -i logs/sac_n_steps.pkl --no-million -r -iqm

Results_LunarLanderContinuous

BipedalWalker-v3

3 seeds

python train.py --algo sac --env BipedalWalker-v3 -P -param n_steps:3 -f logs/n_steps_3 --seed 896937089 --n-eval-envs 5 --eval-episodes 20 -n 150000

Results_BipedalWalker

LunarLander and BipedalWalker combined:

Rliable_metrics

Swimmer-v3

Note: only 3 seeds

Results_Swimmer

@araffin araffin merged commit e206fc5 into master Jun 16, 2025
4 checks passed
@araffin araffin deleted the feat/n-step-replay branch June 16, 2025 10:31
jpizarrom added a commit to jpizarrom/lerobot that referenced this pull request Oct 14, 2025
jpizarrom added a commit to jpizarrom/lerobot that referenced this pull request Oct 19, 2025
jpizarrom added a commit to jpizarrom/lerobot that referenced this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature-request] N-step returns for TD methods

4 participants