Fix misleading comment about action scaling in off-policy algorithms #2171
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1269
Problem
The comment in the
_sample_action()
method of off-policy algorithms was misleading about action scaling behavior. The comment suggested thatpredict()
returns actions already scaled by tanh to the range[-1, 1]
:This caused confusion because users expected actions from
predict()
to be in[-1, 1]
, but they are actually unscaled to the original action space[low, high]
.Solution
Updated the comment to accurately describe the behavior:
Technical Details
The actual flow in off-policy algorithms like SAC and TD3 is:
predict()
method internally uses tanh for action bounding during policy forward passpredict()
returns actions unscaled to the environment's action space[low, high]
_sample_action()
then callsscale_action()
to convert these to[-1, 1]
for replay buffer storageunscale_action()
converts back to[low, high]
for environment interactionThis behavior is correct and intentional - the confusion was purely in the comment's wording.
Verification
Tested with multiple continuous environments (Pendulum-v1, MountainCarContinuous-v0) to confirm:
predict()
returns actions within environment bounds[low, high]
✅scale_action()
properly converts to[-1, 1]
range ✅Also added changelog entry documenting this clarification.
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.