Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Aug 27, 2025

Fixes #1767

This PR adds documentation clarifying the behavior of gSDE (Generalized State-Dependent Exploration) during inference for models trained with use_sde=True.

Problem

When using models trained with gSDE, users expect that setting deterministic=False in model.predict() would provide stochastic behavior similar to training. However, the automatic noise resetting controlled by sde_sample_freq during training does not occur during inference, resulting in deterministic behavior even when deterministic=False.

Solution

Added informative notes to the documentation of PPO, SAC, and A2C explaining:

  1. Expected behavior: gSDE is primarily designed for training, with deterministic behavior recommended for inference
  2. Current limitation: model.predict() does not automatically reset noise, leading to deterministic behavior even with deterministic=False
  3. Workaround: If stochastic behavior is needed during inference, users must manually call model.policy.reset_noise(env.num_envs) at appropriate intervals

Example of the documented behavior

from stable_baselines3 import PPO

# Training with gSDE
model = PPO("MlpPolicy", "Pendulum-v1", use_sde=True, sde_sample_freq=4)
model.learn(total_timesteps=10000)

vec_env = model.get_env()
obs = vec_env.reset()
# During inference - deterministic behavior (recommended)
action, _ = model.predict(obs, deterministic=True)

# If stochastic behavior is needed during inference
# Manual noise resetting is required
step_count = 0
while True:
    if step_count % 4 == 0:  # Reset every 4 steps (matching sde_sample_freq)
        model.policy.reset_noise(vec_env.num_envs)
    
    action, _ = model.predict(obs, deterministic=False)
    obs, reward, done, info = vec_env.step(action)
    step_count += 1

The documentation changes are minimal and focused, adding clear guidance without modifying any code behavior. This addresses the user's question and provides practical guidance for proper gSDE usage during inference.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] I would like to fix issue #1767 by updating the documentation Document gSDE inference behavior for PPO, SAC, and A2C Aug 27, 2025
@Copilot Copilot AI requested a review from araffin August 27, 2025 09:47
Copilot finished work on behalf of araffin August 27, 2025 09:47
@araffin
Copy link
Member

araffin commented Aug 27, 2025

@copilot please update the changelog too

Copy link
Contributor Author

Copilot AI commented Aug 27, 2025

@copilot please update the changelog too

Updated the changelog with documentation entry for gSDE inference behavior clarification (commit 5a3bf4a)

Copilot finished work on behalf of araffin August 27, 2025 09:54
@araffin araffin marked this pull request as ready for review August 27, 2025 10:13
@araffin araffin merged commit 3b249c0 into master Aug 27, 2025
4 checks passed
@araffin araffin deleted the copilot/fix-2d05eea8-03ea-4b3d-a808-4bb4c5a804d4 branch August 27, 2025 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] User needs to reset gSDE noise when using learned model?
2 participants