Skip to content

Commit 179725c

Browse files
authored
Merge branch 'master' into copilot/fix-b1975246-02f5-4774-8cf1-737731887c11
2 parents f6fa72f + f7a89e1 commit 179725c

File tree

2 files changed

+3
-1
lines changed

2 files changed

+3
-1
lines changed

docs/misc/changelog.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ Documentation:
3636
- Added plotting documentation and examples
3737
- Added documentation clarifying gSDE (Generalized State-Dependent Exploration) inference behavior for PPO, SAC, and A2C algorithms
3838
- Documented Atari wrapper reset behavior where ``env.reset()`` may perform a no-op step instead of truly resetting when ``terminal_on_life_loss=True`` (default), and how to avoid this behavior by setting ``terminal_on_life_loss=False``
39+
- Clarified comment in ``_sample_action()`` method to better explain action scaling behavior for off-policy algorithms (@copilot)
3940

4041

4142
Release 2.7.0 (2025-07-25)

stable_baselines3/common/off_policy_algorithm.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -391,7 +391,8 @@ def _sample_action(
391391
unscaled_action = np.array([self.action_space.sample() for _ in range(n_envs)])
392392
else:
393393
# Note: when using continuous actions,
394-
# we assume that the policy uses tanh to scale the action
394+
# the policy internally uses tanh to bound the action but predict() returns
395+
# actions unscaled to the original action space [low, high]
395396
# We use non-deterministic action in the case of SAC, for TD3, it does not matter
396397
assert self._last_obs is not None, "self._last_obs was not set"
397398
unscaled_action, _ = self.predict(self._last_obs, deterministic=False)

0 commit comments

Comments
 (0)