KL Adaptive LR for PPO and LR schedule for SAC/TQC #72

araffin · 2025-05-19T10:39:02Z

Description

See #55 and #25

Motivation and Context

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

I've read the CONTRIBUTION guide (required)
I have updated the changelog accordingly (required).
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have reformatted the code using make format (required)
I have checked the codestyle using make check-codestyle and make lint (required)
I have ensured make pytest and make type both pass. (required)
I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

This reverts commit ab33983.

This reverts commit 5832702.

Copilot

Pull Request Overview

This PR introduces adaptive learning rate scheduling based on KL divergence for PPO, extends learning rate schedules and dynamic updates to SAC and TQC, updates the SBX package version and dependencies, and adds optimizer hyperparameter injection across various policies.

Implement KLAdaptiveLR and integrate it into PPO (target_kl) for adaptive LR control.
Refactor SAC and TQC to use lr_schedule for initial and per-step learning rates.
Inject hyperparameters into JAX optimizers for dynamic LR updates and add ortho_init support in PPO policies.
Bump SBX version to 0.21.0 and update stable_baselines3 requirement to >=2.6.1a1.

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_run.py	Added `target_kl` argument to `test_ppo` to cover adaptive LR scheduling in PPO tests.
setup.py	Updated SBX dependency on `stable_baselines3` to a pre-release `>=2.6.1a1,<3.0`.
sbx/version.txt	Bumped SBX package version from `0.20.0` to `0.21.0`.
sbx/common/utils.py	Introduced `KLAdaptiveLR` dataclass for KL-based adaptive learning rate adjustments.
sbx/common/on_policy_algorithm.py	Added `_update_learning_rate` to allow in-flight hyperparameter injection for on-policy algos.
sbx/common/off_policy_algorithm.py	Added override of `_update_learning_rate` and renamed `qf_learning_rate` to `initial_qf_learning_rate`.
sbx/tqc/tqc.py	Switched optimizers to use `lr_schedule` at init and call dynamic LR updates during training.
sbx/tqc/policies.py	Injected optimizer hyperparameters via `optax.inject_hyperparams` and updated `log_std_init` default.
sbx/sac/sac.py	Mirrored TQC changes: use `lr_schedule` for initial LR and update per training step.
sbx/sac/policies.py	Refactored policy builder to inject hyperparameters into optimizers for dynamic LR updates.
sbx/ppo/ppo.py	Integrated `KLAdaptiveLR`, switched to `FloatSchedule`, extended `_one_update` to return ratio, and updated training loop for adaptive LR.
sbx/ppo/policies.py	Added `ortho_init` flag and optimizer hyperparameter injection; updated default `log_std_init`.
sbx/dqn/dqn.py	Replaced legacy `get_linear_fn` with the new `LinearSchedule` utility.
sbx/crossq/crossq.py	Changed optimizer setup to use `lr_schedule(1)` instead of fixed learning rate.

Comments suppressed due to low confidence (1)

sbx/common/utils.py:7

Add unit tests for KLAdaptiveLR.update to verify its behavior when kl_div values are above, below, or within the target KL margin.

class KLAdaptiveLR:

Copilot · 2025-05-19T10:45:34Z

sbx/tqc/policies.py

        # Note: most gSDE parameters are not used
        # this is to keep API consistent with SB3
-        log_std_init: float = -3,
+        log_std_init: float = 0.0,


Changing the default log_std_init from -3 to 0.0 alters the initial policy variance and is a breaking behavior change; document this in the changelog or consider deprecation notices.

Suggested change

log_std_init: float = 0.0,

log_std_init: float = -3,

Copilot · 2025-05-19T10:45:35Z

sbx/common/off_policy_algorithm.py

        excluded.remove("policy")
        return excluded

+    def _update_learning_rate(  # type: ignore[override]


[nitpick] The learning rate update logic is duplicated in both on-policy and off-policy algorithm classes; consider refactoring this into a shared utility to reduce code duplication.

Copilot · 2025-05-19T10:45:35Z

sbx/ppo/policies.py

    # For MultiDiscrete
    max_num_choices: int = 0
    split_indices: np.ndarray = field(default_factory=lambda: np.array([]))
+    # Last layer with small scale


[nitpick] Document the new ortho_init parameter in the class docstring and __init__ signature to clarify its purpose and how it affects weight initialization.

Suggested change

# Last layer with small scale

araffin added 28 commits February 17, 2025 11:59

Only check for terminated episodes

4d9c49c

Start adding ortho init

c98ebde

Add SimbaPolicy for PPO

cd2ef11

Try adding ortho init to SAC

0461871

Enable lr schedule for PPO

3e42262

Allow to pass lr, prepare for adaptive lr

101c08d

Implement adaptive lr

ef2321d

Add small test

96bf978

Refactor adaptive lr

6e17def

Add adaptive lr for SAC

5832702

Fix qf_learning_rate

ab33983

Revert "Fix qf_learning_rate"

163acbe

This reverts commit ab33983.

Revert "Add adaptive lr for SAC"

0d83720

This reverts commit 5832702.

Revert kl div for SAC changes

85d4f23

Revert dist.mode() in two lines

dc6bf4e

Cleanup code

f0235e6

Add support for Gaussian actor for SAC

0e14aad

Enable Gaussian actor for TQC

1a8063a

Log std too

cefbd78

Avoid NaN in kl div approx

a809a01

Allow to use layer_norm in actor

c92d840

Reformat

f54697a

Allow max grad norm for TQC and fix optimizer class

7afa84f

Comment out max grad norm

3af7c93

Update to schedule classes

3f9727b

Add lr schedule support for TQC

d86e89d

Revert experimental changes and add support for lr schedule for SAC

c668fd1

Add test for adaptive kl div, remove squash output param

22b3e54

araffin requested a review from Copilot May 19, 2025 10:42

Copilot AI reviewed May 19, 2025

View reviewed changes

araffin merged commit 849e908 into master May 19, 2025
4 checks passed

araffin deleted the feat/adaptive-lr branch May 19, 2025 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KL Adaptive LR for PPO and LR schedule for SAC/TQC #72

KL Adaptive LR for PPO and LR schedule for SAC/TQC #72

Uh oh!

araffin commented May 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 19, 2025

Uh oh!

Copilot AI May 19, 2025

Uh oh!

Copilot AI May 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KL Adaptive LR for PPO and LR schedule for SAC/TQC #72

KL Adaptive LR for PPO and LR schedule for SAC/TQC #72

Uh oh!

Conversation

araffin commented May 19, 2025

Description

Motivation and Context

Types of changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants