Skip to content

Conversation

@kyleavery
Copy link
Contributor

Description

This PR doesn't change existing behavior, but it adds new options to the RLConfig:

  • use_stepwise_advantage - If True, treat each assistant turn as its own training sample and use a discounted return per step.
  • stepwise_aggregation - How to compute discounted per-step return R_t from future rewards.
  • stepwise_gamma - Discount factor gamma for previous turns.

It also adds a new parameter to MultiTurnEnv called exclude_think which will remove the CoT from previous turns. I think this only makes sense when using the stepwise advantage implementation.

The implementation is based on Kevin: Multi-Turn RL for Generating CUDA Kernels.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

ronaldnetawat pushed a commit to ronaldnetawat/verifiers that referenced this pull request Nov 13, 2025
* Extract non-ckpt util function

* Implement train ckpt manager

* Fix create ckpt path as dir

* Do not ckpt at step 0

* Change ckpt name

* Fix ckpt step path

* Save correct progress state at correct step

* Auto-setup max steps

* Do not clean by default

* Misc improvements to trainer ckpt

* Implement orchestrator ckpt

* Use same stop condition logic across trainer + orchestrator

* Use correct logprob model when resuming

* Use resume step instead of path

* Cleanup

* Resume with async

* Init optimizer before load from ckpt

* Ensure resume step at least 1

* Load the async logprob model from weights checkpoint

* Save pytorch model HF compatible

* Load all async logprob models

* Fix start ckpt_step

* Save all async logprob models correctly

* Fixes to async resume logic

* Enable resuming a W&B run

* Off by one

* Implement async save

* Disable ckpt by default

* Remove ckpt getter

* Fix type

* Fix keeping async weight ckpts for restart

* Remove description

* Only clean if not resume

* Docs

* Increase default seq len to avoid overlong prompt error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant