Stepwise Advantages for Multi-Turn Training #536

kyleavery · 2025-11-05T19:19:45Z

Description

This PR doesn't change existing behavior, but it adds new options to the RLConfig:

use_stepwise_advantage - If True, treat each assistant turn as its own training sample and use a discounted return per step.
stepwise_aggregation - How to compute discounted per-step return R_t from future rewards.
stepwise_gamma - Discount factor gamma for previous turns.

It also adds a new parameter to MultiTurnEnv called exclude_think which will remove the CoT from previous turns. I think this only makes sense when using the stepwise advantage implementation.

The implementation is based on Kevin: Multi-Turn RL for Generating CUDA Kernels.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

…erifiers into multiturn-discounted-returns

* Extract non-ckpt util function * Implement train ckpt manager * Fix create ckpt path as dir * Do not ckpt at step 0 * Change ckpt name * Fix ckpt step path * Save correct progress state at correct step * Auto-setup max steps * Do not clean by default * Misc improvements to trainer ckpt * Implement orchestrator ckpt * Use same stop condition logic across trainer + orchestrator * Use correct logprob model when resuming * Use resume step instead of path * Cleanup * Resume with async * Init optimizer before load from ckpt * Ensure resume step at least 1 * Load the async logprob model from weights checkpoint * Save pytorch model HF compatible * Load all async logprob models * Fix start ckpt_step * Save all async logprob models correctly * Fixes to async resume logic * Enable resuming a W&B run * Off by one * Implement async save * Disable ckpt by default * Remove ckpt getter * Fix type * Fix keeping async weight ckpts for restart * Remove description * Only clean if not resume * Docs * Increase default seq len to avoid overlong prompt error

kyleavery added 3 commits November 3, 2025 16:22

add exclude_think param to MultiTurnEnv

4a35f2b

implement stepwise advantages

1b12878

stepwise tests

f98f5d8

kyleavery mentioned this pull request Nov 5, 2025

Add exclude_think param to MultiTurnEnv #527

Closed

13 tasks

kyleavery added 5 commits November 5, 2025 13:21

Merge branch 'main' into multiturn-discounted-returns

97c81cd

fix stepwise tests

8f5f6d0

Merge branch 'multiturn-discounted-returns' of github.com:kyleavery/v…

11d8cf4

…erifiers into multiturn-discounted-returns

update config

f2c597f

fix logging

f97729b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stepwise Advantages for Multi-Turn Training #536

Stepwise Advantages for Multi-Turn Training #536

Uh oh!

kyleavery commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Stepwise Advantages for Multi-Turn Training #536

Are you sure you want to change the base?

Stepwise Advantages for Multi-Turn Training #536

Uh oh!

Conversation

kyleavery commented Nov 5, 2025

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant