Fix misleading comment about action scaling in off-policy algorithms #2171

Copilot · 2025-08-29T09:56:07Z

Problem

The comment in the _sample_action() method of off-policy algorithms was misleading about action scaling behavior. The comment suggested that predict() returns actions already scaled by tanh to the range [-1, 1]:

# Note: when using continuous actions,
# we assume that the policy uses tanh to scale the action

This caused confusion because users expected actions from predict() to be in [-1, 1], but they are actually unscaled to the original action space [low, high].

Solution

Updated the comment to accurately describe the behavior:

# Note: when using continuous actions,
# the policy internally uses tanh to bound the action but predict() returns
# actions unscaled to the original action space [low, high]

Technical Details

The actual flow in off-policy algorithms like SAC and TD3 is:

predict() method internally uses tanh for action bounding during policy forward pass
But predict() returns actions unscaled to the environment's action space [low, high]
_sample_action() then calls scale_action() to convert these to [-1, 1] for replay buffer storage
Finally, unscale_action() converts back to [low, high] for environment interaction

This behavior is correct and intentional - the confusion was purely in the comment's wording.

Verification

Tested with multiple continuous environments (Pendulum-v1, MountainCarContinuous-v0) to confirm:

predict() returns actions within environment bounds [low, high] ✅
scale_action() properly converts to [-1, 1] range ✅
No existing functionality affected ✅

Also added changelog entry documenting this clarification.

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

…ling behavior Co-authored-by: araffin <[email protected]>

Initial plan

46f1193

Copilot AI assigned Copilot and araffin Aug 29, 2025

Copilot started work on behalf of araffin August 29, 2025 09:56 View session

Fix misleading comment in _sample_action method to clarify action sca…

ff67793

…ling behavior Co-authored-by: araffin <[email protected]>

Copilot AI requested a review from araffin August 29, 2025 10:06

Copilot finished work on behalf of araffin August 29, 2025 10:06

araffin marked this pull request as ready for review August 29, 2025 10:08

araffin approved these changes Aug 29, 2025

View reviewed changes

araffin merged commit f7a89e1 into master Aug 29, 2025
8 checks passed

araffin deleted the copilot/fix-1948d31d-2884-4121-a195-9f6ecb239ad7 branch August 29, 2025 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix misleading comment about action scaling in off-policy algorithms #2171

Fix misleading comment about action scaling in off-policy algorithms #2171

Uh oh!

Copilot AI commented Aug 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Fix misleading comment about action scaling in off-policy algorithms #2171

Fix misleading comment about action scaling in off-policy algorithms #2171

Uh oh!

Conversation

Copilot AI commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Technical Details

Verification

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Aug 29, 2025 •

edited

Loading