30 Sep 08:53

github-actions

2c3061f

v0.1.5.dev0

What's Changed

Fix small typos by @anakin87 in #356
remove constraint on python version by @samsja in #368
Fix typo in README.md: 'with along' → 'along with' by @CodeSinghh in #378
Fix: reasoning-gym : match load_environment args to init() args by @code-juicer in #377
fix/update links by @anakin87 in #372
Fix **kwargs in load_environment breaking by @mikasenghaas in #385
Add average_reward column to make_dataset by @faresobeid in #365
Add repeatable --header support to vf-eval for sending additional headers to OpenAI client by @AmeenP in #386
Updates for ToolEnv + StatefulToolEnv for sandboxes by @willccbb in #384
Truncate prompt mask of overly long prompts + completions by @nreHieW in #382
Deserialize function tool call argument before applying chat template by @mikasenghaas in #376
finish_reason=length if env caused truncation by @cat-state in #360

New Contributors

@anakin87 made their first contribution in #356
@CodeSinghh made their first contribution in #378
@code-juicer made their first contribution in #377
@faresobeid made their first contribution in #365
@AmeenP made their first contribution in #386
@nreHieW made their first contribution in #382

Full Changelog: v0.1.4...v0.1.5.dev0

Contributors

AmeenP, willccbb, and 8 other contributors

Assets 4

22 Sep 06:39

github-actions

v0.1.4

98517e1

v0.1.4

Release v0.1.4

TLDR:

Refactor of Environment.a_generate + Rubric internals to support interleaved generation and scoring (enabled by default)
Addresses lots of other small issues + QoL improvements, see below for details

What's Changed

vf-tui parsing crashes when tool_calls contains JSON strings instead of dicts by @nancyjlau in #250
Fix eval saving failing on -n -1 by @mikasenghaas in #255
Fix missing parser parameter in Rubric instances across environments by @bdsaglam in #276
fix(tui): escape user content to prevent markup injection issues by @srthkdev in #273
Make answer + info both optional by @willccbb in #282
docs(env): clarify optional answer/info fields and evaluation behavior by @srthkdev in #268
docs(rubric): add documentation for passing class objects to reward functions by @srthkdev in #269
feat(verifiers): add MathRubric to verifiers module by @srthkdev in #263
chore(eval): add logging throughout evaluation script for better traceability by @srthkdev in #262
fix markuperror in completion by @jalexine in #284
Math python tweak by @willccbb in #286
fix: add robust function schema parsing by @dhruvrnaik in #285
disable max turns default by @willccbb in #292
Fix max_concurrent_requests in eval script and also use for rollout scoring by @mikasenghaas in #294
Rename max_concurrent_requests to max_concurrent by @mikasenghaas in #295
Fix log verbosity of third-party packages in eval script by @mikasenghaas in #296
fix: Propagate tool_call_id in prompt messages by @walln in #318
Higher timeouts and limits for eval client by @mikasenghaas in #316
Update init.py for StatefulToolEnv by @stangirala in #306
Option to find last instance of \boxed{} by @kyleavery in #310
fix(envs): prevent IndexError by capping dataset selection range by @srthkdev in #314
fix: correct grammar in README - remove extra word 'in' by @Traddoo in #283
Optionally init environment as multi-file package by @mikasenghaas in #300
Added audio modality support by @yurpl in #312
Will/normalize n eval by @willccbb in #319
fix(verifiers): add error handling for judge model API calls by @srthkdev in #291
actions update, publish-environments by @willccbb in #323
fix(env_utils): add detailed logging to environment loader by @srthkdev in #309
fix(envs): Async tool call bug in StatefulToolEnv by @bdsaglam in #326
fix: num_iterations by @ZhichenRen in #320
refactor(envs): rename push_to_hub to push_to_hf_hub in make_dataset by @srthkdev in #336
fix(env_utils): improve error message for load_environment function absence by @srthkdev in #335
orchestration refactor for interleaved generation and scoring by @willccbb in #324
docs: document environments hub usage patterns by @willccbb in #344
mcp verifiers env by @cdreetz in #343
AGENTS.md by @willccbb in #346
Add ty pre-commit hook by @willccbb in #347
ARC-AGI-3 environment by @willccbb in #348
fix typing issues by @willccbb in #355
Add import regression tests to prevent missing exports by @fsndzomga in #353
Fix RubricGroup score_rollout parser handling by @willccbb in #357
v0.1.4 release by @willccbb in #358

New Contributors

@nancyjlau made their first contribution in #250
@bdsaglam made their first contribution in #276
@srthkdev made their first contribution in #273
@jalexine made their first contribution in #284
@dhruvrnaik made their first contribution in #285
@walln made their first contribution in #318
@stangirala made their first contribution in #306
@kyleavery made their first contribution in #310
@Traddoo made their first contribution in #283
@yurpl made their first contribution in #312
@ZhichenRen made their first contribution in #320
@cdreetz made their first contribution in #343
@fsndzomga made their first contribution in #353

Full Changelog: v0.1.3...v0.1.4

Contributors

stangirala, kyleavery, and 13 other contributors

Assets 4

05 Sep 18:39

willccbb

v0.1.3.post0

b4d851d

v0.1.3.post0

Release Notes - v0.1.3.post0

Minor release for miscellaneous fixes + small API tweaks which should not impact the vast majority of users, beyond addressing bugs. Full notes + credits deferred to next proper version release.

Changelog:

* 5d887a9 (HEAD -> main) docs -> dev deps
* 9be7d22 (origin/main) Fix setting log level globally (#296)
* 5342a18 Rename `max_concurrent_requests` to `max_concurrent` (#295)
* 9d962b0 Fix `max_concurrent_requests` in eval script and also use for rollout scoring (#294)
* bfc98e3 hotfix for json-serialized info
* b314ce7 disable max turns default (#292)
* d66990c readme
* 061b28a version
* 631257e post0 version
* 4f2a71c t-e pyproj bump
* 5970f67 toxicity_explanation hotfix
* 4558ffe fix: add robust function schema parsing (#285)
* cca161f Math python tweak (#286)
* 94fef56 fix markuperror in completion (#284)
* 4daa4b3 chore(eval): add logging throughout evaluation script for better traceability (#262)
* 2871822 feat(verifiers): add MathRubric to verifiers module (#263)
* 292214b docs(rubric): add documentation for passing class objects to reward functions (#269)
* bfbb311 docs(env): clarify optional answer/info fields and evaluation behavior (#268)
* 6675c8b answer + info both optional (#282)
* 5305b16 fix(tui): escape user content to prevent markup injection issues (#273)
* d64d701 Fix missing parser parameter in Rubric instances across environments (#276)
* fb1b4c1 fix
* b0e2df3 readme
* 615ab08 Fix eval saving failing on `-n -1` (#255)
* 85ae8e4 detect when tool_calls is a list of JSON strings (#250)
* 2106820 (tag: v0.1.3) Release version 0.1.3

Assets 2

26 Aug 11:56

willccbb

v0.1.3

2106820

v0.1.3

Verifiers v0.1.3 Release Notes

Date: 8/26/25

Verifiers v0.1.3 adds a number of features for expanded functionality and ease of use, along with additional library integrations and bug fixes.

Highlights

We now have a TUI! 🎉 Run vf-tui to interactively browse all locally-saved evaluation results in your terminal.
Overhauled logging for vf-eval evaluation results with tagged JSON artifact folders.
- Defaults to saving in your environment's project directory under outputs/ if developing locally; ./outputs if using an environment installed from elsewhere.
- The short-lived Markdown report outputs are now deprecated.
Multimodal-input tasks are supported for evaluation (see environments/mmmu for an example)! Official trainer support in verifiers is pending, and can be accessed via HUD's hud-vf-gym project.
Optional async for reward functions, tools, and Environment class methods
- maybe_await pattern for safe accommodation of both sync and async functions
- Sync extensions of env_response and is_completed in MultiTurnEnv will work, but with a type warning; users are encouraged to migrate these functions to async for ongoing usage.
Full JSON sampling args in vf-eval via -S (#240).
Official community examples library under very active development: prime-environments
Native init/push/pull/install support in prime-cli (and more...)
- Run uv tool install prime for a preview 🙂
Feature-complete support for training and online evaluations in prime-rl.
Improved caching and parallelization for JudgeRubric.
Rubric.class_objects values are available to all reward functions by key name.
Bug fixes for tool call sanitization and saving datasets to Huggingface
Improvements to documentation.
From the recent 0.1.2.post1 pre-release version:
- StatefulToolEnv for intercepting function calls for routing and state management (#224)
- Improved lazy imports for efficient evaluation.
- Overhauled MathRubric for math-verify as default reward.
- Full support restored for completions generation (#201, #196).
New required dependencies since 0.1.2: rich, textual, jinja.

Thanks to everyone who contributed to this release!

@lakshyaag (#240, #241)
@cat-state (#238)
@qgallouedec (#218, #217)
@vgel (#201, #196)
@nathom (#200)
@snellingio (#195, #194)
@MarwanMashra (#184)
@alanxmay
And a special thanks to the entire Prime Intellect team, with PRs this cycle from:
@JannikSt
@mikasenghaas
@samsja

Stay tuned for some big announcements in the coming days 😊

Full Changelog: v0.1.2...v0.1.3

Contributors

vgel, snellingio, and 9 other contributors

Assets 2

23 Aug 08:17

willccbb

v0.1.2.post1

ed09d63

v0.1.2.post1

Verifiers v0.1.2.post1 – Release Notes

Incremental update focused on a new stateful tool environment, environment folder cleanup/renaming, math verification robustness, reporting improvements, and bug fixes.

Highlights

Stateful tools: add a stateful tool environment and move tool JSON loading into environment responses (PR #224).
Environments: consolidation/renames for clarity and new environment tags (PR #222 and related changes).
Lazy imports: training-related libraries are only imported when accessed
Verification: more robust default math verification (PR #213).
RL support: enable base-model RL with message_type="completions" (PR #201), plus Prime-RL integration and docs (PR #204) and GRPO trainer updates (PR #217, #218).
Reporting & endpoints: template/report tweaks and endpoint path loading improvements (PR #206, PR #203, plus follow-ups).
CLI/UX: make rich a default dependency for the eval script (PR #200); eval output refinements.
Fixes: hotfix for sampling args for gpt-5.

Changes by Area

CLI and Scripts

vf-eval
- Add rich as a default dependency to improve output readability (PR #200).
- Refine eval outputs and result handling (PR #223 and related commits).
Hotfixes
- Update sampling args for gpt-5 (hotfix commit).

Environments and Examples

Add a stateful tool environment; load tool information via environment responses (PR #224).
Rename and consolidate environments, introduce tag metadata for discoverability (PR #222; additional env tag updates).
Math environment updates and prompt tweaks.
Remove dead processing code in environment.py; general cleanup and type hint improvements.

Parsers, Rubrics, and Utils

Caching improvements for JudgeRubric to reduce redundant work (PR #216).
More robust rule-based math verification and heuristics (PR #213).
General type-hint and internal cleanup passes.

Training

Document Prime-RL training (PR #204).
Minor updates to GRPO trainer (PR #217, #218).
Add support for base-model RL flows via message_type="completions" (PR #201).

Reporting and Tooling

Report generation and template tweaks (PR #206, PR #203).
Improve endpoint path loading and related tooling.

Documentation

README and docs updates (minor) across environments and training utilities; additional guidance for reporting.

Upgrade Notes

Environment renames/tags: if you reference environment names or use tags in tooling or scripts, review the updated names and tag metadata (PR #222).

Reference Commits (since v0.1.2.post0)

adding stateful toolenv, moving tool json loading to env_response (PR #224)
Will/eval outputs (PR #223)
Update grpo_trainer.py (PR #217, PR #218)
hotfix for gpt-5 sampling args
Will/rename envs (PR #222)
Will/judgerubric caching (PR #216)
More robust rule-based math verification (PR #213)
Report tweaks and endpoints path loading (PR #206 and follow-ups)
Integrate and document prime-rl training (PR #204)
Update report generation and vf-init template (PR #203)
Add support for base model RL / message_type="completions" (PR #201)
Add rich as default dependency for eval script (PR #200)
Math env updates, prompt tweaks, type hints, and cleanup in environment.py

Full Changelog

v0.1.2.post0...HEAD: v0.1.2.post0...HEAD

Assets 2

09 Aug 00:27

willccbb

v0.1.2.post0

a3ce9d3

v0.1.2.post0

Verifiers v0.1.2.post0 – Release Notes

Minor post-release update focusing on polish: CLI script bug fixes and enhancements, environment example cleanup, better reporting, and improved test coverage.

Highlights

vf-eval: fixed rollout indexing bugs and improved reliability when sampling multiple rollouts.
vf-init: streamlined project initialization and naming (removed automatic vf- prefix) and refreshed templates.
Environments: documentation and prompt cleanups; added/updated AIME examples; improved report embedding.
Tests: expanded coverage across rubric behavior, XML parser, and environment edge cases.

Changes by Area

CLI and Scripts

vf-eval
- Fix index handling when using multiple rollouts (PR #197).
- Ensure metrics columns are included in generated datasets via supporting utilities (PR #194).
vf-init
- Remove automatic vf- prefix during init to honor provided names (PR #190).
- Update README template/content for new environments (multiple small tweaks).

Environments and Examples

AIME 2024 / AIME 2025 updates (PR #199).
Math Python example: prompt/readme/report cleanups.
General environment cleanup and README refreshes across multiple examples.
HotpotQA example: troubleshooting notes and minor fixes.

Parsers, Rubrics, and Utils

XMLParser: fix handling of string completions during parse_answer (PR #196).
Rubric: ensure error-handling behavior is well-covered by tests (PR #195).
Reporting: improvements to report generation/embedding (report_utils).
Dataset helpers: include metrics columns in outputs where expected (PR #194).

Tests

Increase test coverage for:
- Rubric error handling (PR #195).
- XML parser behavior (new tests).
- Environment edge cases and extra scenarios.

Acknowledgements

Thank you to everyone who contributed to this minor release:

If we missed anyone, thank you as well—your contributions are appreciated.

Upgrade Notes

No breaking API changes.
When initializing a new environment with vf-init, note the name is now used verbatim (no automatic vf- prefix, PR #190).

Reference Commits (since v0.1.2)

Fix XMLParser string completion parsing (PR #196)
Improve test coverage for Rubric error handling (PR #195)
Include metrics columns in dataset outputs (PR #194)
Fix vf-eval rollout index handling (PR #197)
Remove automatic vf- prefix from init (PR #190)
AIME 2024 / 2025 environments updates (PR #199)
Environment README/reporting cleanups and misc improvements

Full Changelog

v0.1.2...v0.1.2.post0

Assets 2

31 Jul 02:34

willccbb

v0.1.2

21db7ac

v0.1.2

What's changed

With the v0.1.2 release, verifiers is significantly more production-ready, and stable to build and train with. We appreciate everyone's patience with the changes and bug fixes thus far as we've addressed a number of long-time requests, and are excited to see what you all build with it!

Highlights:

Proper encapsulation of Environments as standalone modules (see environments/), which can contain their own dependencies in a pyproject.toml, and need only to expose a load_environment(...) -> vf.Environment function in order to be trainable.
Script flows for initializing (vf-init), installing (vf-install), and evaluating (vf-eval) Environments before training.
Reorganization of examples and training scripts, removing lots of duplicated logic and creating a cleaner separation between library code and example code.
Deprecation of the manual dynamically-batched LLM inference worker in favor of proper AsyncLLM support, allowing full control of native vLLM sampling parameters.
Support for native tool call parsing + parallel tool calls in ToolEnv (replacing the manual XMLParser approach).
Another trainer! Environments built with verifiers are now trainable with prime-rl (as of 58ac91f for v0.1.2), which supports multi-node FSDP async training, is the primary RL framework used by the Prime Intellect research team, and is under ongoing development and stress-testing in advance of large-scale multi-environment training runs.
Pydantic types for core data classes used by Environments.
Improvements to GRPOTrainer, including supporting a single max_seq_len option (instead of separate prompt + completion lengths), and configurable turn length limits via max_tokens.
Many more Environment examples.
Improved logging and evaluation options.
Overhauled README.md and docs.

Assets 2

Releases: PrimeIntellect-ai/verifiers

v0.1.5.dev0

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.4

Release v0.1.4

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.3.post0

Release Notes - v0.1.3.post0

Uh oh!

v0.1.3

Verifiers v0.1.3 Release Notes

Highlights

Contributors

Uh oh!

v0.1.2.post1

Verifiers v0.1.2.post1 – Release Notes

Highlights

Changes by Area

CLI and Scripts

Environments and Examples

Parsers, Rubrics, and Utils

Training

Reporting and Tooling

Documentation

Upgrade Notes

Reference Commits (since v0.1.2.post0)

Full Changelog

Uh oh!

v0.1.2.post0

Verifiers v0.1.2.post0 – Release Notes

Highlights

Changes by Area

CLI and Scripts

Environments and Examples

Parsers, Rubrics, and Utils

Tests

Acknowledgements

Upgrade Notes

Reference Commits (since v0.1.2)

Full Changelog

Uh oh!

v0.1.2

What's changed

Uh oh!