Skip to content

Conversation

@hallerite
Copy link

@hallerite hallerite commented Nov 7, 2025

Description

Adds GymEnv, an optional Environment subclass that runs classic reset/step simulator loops. This lets you plug Gym-style environments directly into verifiers (including custom simulators defined inside this repo) without first converting them into a static dataset.

GymEnv supports:

  • Homogeneous mode (env_cls): one env class with optional env_kwargs; dataset rows feed directly into reset(**info).
  • Heterogeneous mode (env_registry): a registry of env classes; each dataset row specifies info.env_type and optional info.env_kwargs to select/configure the env per rollout.
  • Custom mode (subclass + _make_env): full control over environment construction.

Additional features:

  • automatic dataset generation when none is provided (so RL training “just works”),
  • optional evaluation override via a user-supplied eval_runner.

GymEnv is fully opt-in and does not affect existing environments.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

  • Dataset & trainer integration
    • Environment requires at least one of dataset or eval_dataset.
      GymEnv now handles this automatically:
      • If you provide a dataset or eval dataset explicitly, they are used as-is.
      • If you provide no datasets and keep auto_dummy_eval=True:
        • We auto-build a training dataset of length num_train_episodes via _build_auto_dataset(...).
        • In homogeneous mode (env_cls set):
          • dataset = this auto dataset.
          • eval_dataset = a 1-row dummy eval set, preserving “episodes mode” behavior where num_examples maps cleanly to rollouts_per_example.
        • In heterogeneous mode (env_registry set):
          • We auto-generate rows where each info contains a valid env_type (round-robin over registry keys) and default env_kwargs={}.
          • This auto dataset is used for both dataset and eval_dataset, ensuring all rows map cleanly into actual env instances.

TBD

  • Demonstrate an end-to-end RL training run using RLTrainer + GymEnv (homogeneous and heterogeneous cases) to validate.

@CLAassistant
Copy link

CLAassistant commented Nov 7, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ hallerite
✅ willccbb
❌ snimu
You have signed the CLA already but the status is still pending? Let us recheck it.

@willccbb
Copy link
Member

willccbb commented Nov 9, 2025

nice! would you wanna do the PR on top of the trajectories branch? #549

there's a bunch of new things which should make native gym-style rollouts much easier, especially for training.

also would be ideal if we still had a notion of a dataset, as this is how trainers typically expect to work with verifiers envs

can be generated at init with a given number of hidden state samples (as in textarena_env)

@hallerite
Copy link
Author

yes! will do it on top of it. I'll just wait for the other to be finished

ronaldnetawat pushed a commit to ronaldnetawat/verifiers that referenced this pull request Nov 13, 2025
* verifiers integration v0.0

* verifiers integration v0.0

* training finishes on reversetext

* reverse_text trains 0.13 -> 0.70

* pin git branch in pyproject

* removed ac in config

* remove DataConfig

* removed print debug

* configurable masks

* rm dataconfig instance

* pinned vf commit, removed extra deps

* registry cleanup, reworked gsm8k, removed default env

* bumped verifiers commit, fixed training divergence vs refactor branch

* vf simple-math

* simple-math edits

* debug

* simple_math train matching reference

* rename math tasks, port to verifiers

* Update configs

* Fix imports

* Add back envs that were accidentically deleted during rebase

* Fix sampling.n missing and unused vars

* Remove redundant log

* Update README with new task names

* Fix hendrycks math config path in README

* Update W&B project names

* Fix wrong config key

* Add missing import

* Fix sampling.n not defined

* Fix missing config key

* Do not tokenize in eval

* Fix typos

* Add 1B and 7B hendryck's and intellect math

* Remove comment

* fix tests and configs (PrimeIntellect-ai#548)

* add tests orch

* fix configs

* fix configs

* fix tests

* fix pydantic ofnig

* fix tetstp

* Dispatch subconfigs via tmp toml file

* Add correct GPU placement for int math run

* More consistent var names

* Set project and model also in orchestrator

* Update readme (PrimeIntellect-ai#550)

* fix readme

* fix scripts

* Parse single-turn prompt and completions tokens/ logprobs from vLLM directly via mock process_env_results function

* Update verifiers rev

* Fix style

* Do not filter if field is missing

---------

Co-authored-by: Mika Senghaas <[email protected]>
Co-authored-by: samsja <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants