Recalculate Returns and Advantages After Callback to Ensure Reward Consistency (common/on_policy_algorithm.py) #2000
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello There, SB3 Team! 👋
I hope this message finds you well. First and foremost, I want to extend my gratitude for the fantastic work you’ve done on the Stable Baselines3 library. It's truly a remarkable tool and has been invaluable for my projects.
I apologize in advance for any inconsistencies or if I have missed any items in the checklist. Your understanding and guidance are greatly appreciated.
In this pull request, I’ve added a second call to
rollout_buffer.compute_returns_and_advantage()
after invoking theon_rollout_end()
callback. This adjustment accounts for any potential modifications to the rewards that might occur within custom callback logic. By recalculating returns and advantages, we ensure that the training process reflects the most up-to-date reward structure, improving accuracy and stability of the advantage estimates during the policy update step.This is especially useful when using callbacks that manipulate rewards for custom reward shaping or augmentation.
Description
Recalculated returns and advantages after the
on_rollout_end
callback to handle any reward transformations or manipulations performed by custom callbacks. This ensures that the rollout buffer reflects the updated rewards when computing advantages for the policy update step.Previously,
rollout_buffer.compute_returns_and_advantage()
was only called before the callback. If any rewards were modified by the callback duringon_rollout_end()
, the advantages would not reflect these changes. By adding a second call tocompute_returns_and_advantage()
, we ensure that the correct values are used for training.Code Changes
Motivation and Context
Why is this change required?
This change ensures that if a callback modifies rewards during the
on_rollout_end()
phase, the computed returns and advantages are updated accordingly. Without recalculating these values, any changes to the rewards made by the callback would not be reflected in the advantage estimates used for policy updates, potentially leading to suboptimal training performance.By recalculating returns and advantages, we ensure the most accurate representation of the updated rewards, improving stability and effectiveness in training, especially when using reward shaping techniques or custom callbacks.
Types of changes
Checklist
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)make doc
(required)Note: You can run most of the checks using
make commit-checks
.Note: we are using a maximum length of 127 characters per line