Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Aug 29, 2025

Fixes #1269

Problem

The comment in the _sample_action() method of off-policy algorithms was misleading about action scaling behavior. The comment suggested that predict() returns actions already scaled by tanh to the range [-1, 1]:

# Note: when using continuous actions,
# we assume that the policy uses tanh to scale the action

This caused confusion because users expected actions from predict() to be in [-1, 1], but they are actually unscaled to the original action space [low, high].

Solution

Updated the comment to accurately describe the behavior:

# Note: when using continuous actions,
# the policy internally uses tanh to bound the action but predict() returns
# actions unscaled to the original action space [low, high]

Technical Details

The actual flow in off-policy algorithms like SAC and TD3 is:

  1. predict() method internally uses tanh for action bounding during policy forward pass
  2. But predict() returns actions unscaled to the environment's action space [low, high]
  3. _sample_action() then calls scale_action() to convert these to [-1, 1] for replay buffer storage
  4. Finally, unscale_action() converts back to [low, high] for environment interaction

This behavior is correct and intentional - the confusion was purely in the comment's wording.

Verification

Tested with multiple continuous environments (Pendulum-v1, MountainCarContinuous-v0) to confirm:

  • predict() returns actions within environment bounds [low, high]
  • scale_action() properly converts to [-1, 1] range ✅
  • No existing functionality affected ✅

Also added changelog entry documenting this clarification.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@Copilot Copilot AI changed the title [WIP] Update the comment in @DLR-RM/stable-baselines3/files/stable_baselines3/common/policies.py to clarify the behavior, as described in issue #1269, don't forget to update the changelog too. Fix misleading comment about action scaling in off-policy algorithms Aug 29, 2025
@Copilot Copilot AI requested a review from araffin August 29, 2025 10:06
Copilot finished work on behalf of araffin August 29, 2025 10:06
@araffin araffin marked this pull request as ready for review August 29, 2025 10:08
@araffin araffin merged commit f7a89e1 into master Aug 29, 2025
8 checks passed
@araffin araffin deleted the copilot/fix-1948d31d-2884-4121-a195-9f6ecb239ad7 branch August 29, 2025 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] Wrong scaled_action for continuous actions in _sample_action()?
2 participants