Releases: PrimeIntellect-ai/verifiers
v0.1.5.dev0
What's Changed
- Fix small typos by @anakin87 in #356
- remove constraint on python version by @samsja in #368
- Fix typo in README.md: 'with along' → 'along with' by @CodeSinghh in #378
- Fix: reasoning-gym : match load_environment args to init() args by @code-juicer in #377
- fix/update links by @anakin87 in #372
- Fix **kwargs in
load_environmentbreaking by @mikasenghaas in #385 - Add average_reward column to make_dataset by @faresobeid in #365
- Add repeatable --header support to vf-eval for sending additional headers to OpenAI client by @AmeenP in #386
- Updates for ToolEnv + StatefulToolEnv for sandboxes by @willccbb in #384
- Truncate prompt mask of overly long prompts + completions by @nreHieW in #382
- Deserialize function tool call argument before applying chat template by @mikasenghaas in #376
- finish_reason=length if env caused truncation by @cat-state in #360
New Contributors
- @anakin87 made their first contribution in #356
- @CodeSinghh made their first contribution in #378
- @code-juicer made their first contribution in #377
- @faresobeid made their first contribution in #365
- @AmeenP made their first contribution in #386
- @nreHieW made their first contribution in #382
Full Changelog: v0.1.4...v0.1.5.dev0
v0.1.4
Release v0.1.4
TLDR:
- Refactor of
Environment.a_generate+Rubricinternals to support interleaved generation and scoring (enabled by default) - Addresses lots of other small issues + QoL improvements, see below for details
What's Changed
- vf-tui parsing crashes when tool_calls contains JSON strings instead of dicts by @nancyjlau in #250
- Fix eval saving failing on
-n -1by @mikasenghaas in #255 - Fix missing parser parameter in Rubric instances across environments by @bdsaglam in #276
- fix(tui): escape user content to prevent markup injection issues by @srthkdev in #273
- Make answer + info both optional by @willccbb in #282
- docs(env): clarify optional answer/info fields and evaluation behavior by @srthkdev in #268
- docs(rubric): add documentation for passing class objects to reward functions by @srthkdev in #269
- feat(verifiers): add MathRubric to verifiers module by @srthkdev in #263
- chore(eval): add logging throughout evaluation script for better traceability by @srthkdev in #262
- fix markuperror in completion by @jalexine in #284
- Math python tweak by @willccbb in #286
- fix: add robust function schema parsing by @dhruvrnaik in #285
- disable max turns default by @willccbb in #292
- Fix
max_concurrent_requestsin eval script and also use for rollout scoring by @mikasenghaas in #294 - Rename
max_concurrent_requeststomax_concurrentby @mikasenghaas in #295 - Fix log verbosity of third-party packages in eval script by @mikasenghaas in #296
- fix: Propagate tool_call_id in prompt messages by @walln in #318
- Higher timeouts and limits for eval client by @mikasenghaas in #316
- Update init.py for StatefulToolEnv by @stangirala in #306
- Option to find last instance of \boxed{} by @kyleavery in #310
- fix(envs): prevent IndexError by capping dataset selection range by @srthkdev in #314
- fix: correct grammar in README - remove extra word 'in' by @Traddoo in #283
- Optionally init environment as multi-file package by @mikasenghaas in #300
- Added audio modality support by @yurpl in #312
- Will/normalize n eval by @willccbb in #319
- fix(verifiers): add error handling for judge model API calls by @srthkdev in #291
- actions update, publish-environments by @willccbb in #323
- fix(env_utils): add detailed logging to environment loader by @srthkdev in #309
- fix(envs): Async tool call bug in StatefulToolEnv by @bdsaglam in #326
- fix: num_iterations by @ZhichenRen in #320
- refactor(envs): rename push_to_hub to push_to_hf_hub in make_dataset by @srthkdev in #336
- fix(env_utils): improve error message for load_environment function absence by @srthkdev in #335
- orchestration refactor for interleaved generation and scoring by @willccbb in #324
- docs: document environments hub usage patterns by @willccbb in #344
- mcp verifiers env by @cdreetz in #343
- AGENTS.md by @willccbb in #346
- Add ty pre-commit hook by @willccbb in #347
- ARC-AGI-3 environment by @willccbb in #348
- fix typing issues by @willccbb in #355
- Add import regression tests to prevent missing exports by @fsndzomga in #353
- Fix RubricGroup score_rollout parser handling by @willccbb in #357
- v0.1.4 release by @willccbb in #358
New Contributors
- @nancyjlau made their first contribution in #250
- @bdsaglam made their first contribution in #276
- @srthkdev made their first contribution in #273
- @jalexine made their first contribution in #284
- @dhruvrnaik made their first contribution in #285
- @walln made their first contribution in #318
- @stangirala made their first contribution in #306
- @kyleavery made their first contribution in #310
- @Traddoo made their first contribution in #283
- @yurpl made their first contribution in #312
- @ZhichenRen made their first contribution in #320
- @cdreetz made their first contribution in #343
- @fsndzomga made their first contribution in #353
Full Changelog: v0.1.3...v0.1.4
v0.1.3.post0
Release Notes - v0.1.3.post0
Minor release for miscellaneous fixes + small API tweaks which should not impact the vast majority of users, beyond addressing bugs. Full notes + credits deferred to next proper version release.
Changelog:
* 5d887a9 (HEAD -> main) docs -> dev deps
* 9be7d22 (origin/main) Fix setting log level globally (#296)
* 5342a18 Rename `max_concurrent_requests` to `max_concurrent` (#295)
* 9d962b0 Fix `max_concurrent_requests` in eval script and also use for rollout scoring (#294)
* bfc98e3 hotfix for json-serialized info
* b314ce7 disable max turns default (#292)
* d66990c readme
* 061b28a version
* 631257e post0 version
* 4f2a71c t-e pyproj bump
* 5970f67 toxicity_explanation hotfix
* 4558ffe fix: add robust function schema parsing (#285)
* cca161f Math python tweak (#286)
* 94fef56 fix markuperror in completion (#284)
* 4daa4b3 chore(eval): add logging throughout evaluation script for better traceability (#262)
* 2871822 feat(verifiers): add MathRubric to verifiers module (#263)
* 292214b docs(rubric): add documentation for passing class objects to reward functions (#269)
* bfbb311 docs(env): clarify optional answer/info fields and evaluation behavior (#268)
* 6675c8b answer + info both optional (#282)
* 5305b16 fix(tui): escape user content to prevent markup injection issues (#273)
* d64d701 Fix missing parser parameter in Rubric instances across environments (#276)
* fb1b4c1 fix
* b0e2df3 readme
* 615ab08 Fix eval saving failing on `-n -1` (#255)
* 85ae8e4 detect when tool_calls is a list of JSON strings (#250)
* 2106820 (tag: v0.1.3) Release version 0.1.3
v0.1.3
Verifiers v0.1.3 Release Notes
Date: 8/26/25
Verifiers v0.1.3 adds a number of features for expanded functionality and ease of use, along with additional library integrations and bug fixes.
Highlights
- We now have a TUI! 🎉 Run
vf-tuito interactively browse all locally-saved evaluation results in your terminal. - Overhauled logging for
vf-evalevaluation results with tagged JSON artifact folders.- Defaults to saving in your environment's project directory under
outputs/if developing locally;./outputsif using an environment installed from elsewhere. - The short-lived Markdown report outputs are now deprecated.
- Defaults to saving in your environment's project directory under
- Multimodal-input tasks are supported for evaluation (see
environments/mmmufor an example)! Official trainer support in verifiers is pending, and can be accessed via HUD's hud-vf-gym project. - Optional
asyncfor reward functions, tools, and Environment class methodsmaybe_awaitpattern for safe accommodation of both sync and async functions- Sync extensions of
env_responseandis_completedin MultiTurnEnv will work, but with a type warning; users are encouraged to migrate these functions to async for ongoing usage.
- Full JSON sampling args in
vf-evalvia-S(#240). - Official community examples library under very active development: prime-environments
- Native
init/push/pull/installsupport in prime-cli (and more...)- Run
uv tool install primefor a preview 🙂
- Run
- Feature-complete support for training and online evaluations in prime-rl.
- Improved caching and parallelization for JudgeRubric.
Rubric.class_objectsvalues are available to all reward functions by key name.- Bug fixes for tool call sanitization and saving datasets to Huggingface
- Improvements to documentation.
- From the recent
0.1.2.post1pre-release version: - New required dependencies since
0.1.2:rich,textual,jinja.
Thanks to everyone who contributed to this release!
- @lakshyaag (#240, #241)
- @cat-state (#238)
- @qgallouedec (#218, #217)
- @vgel (#201, #196)
- @nathom (#200)
- @snellingio (#195, #194)
- @MarwanMashra (#184)
- @alanxmay
And a special thanks to the entire Prime Intellect team, with PRs this cycle from: - @JannikSt
- @mikasenghaas
- @samsja
Stay tuned for some big announcements in the coming days 😊
Full Changelog: v0.1.2...v0.1.3
v0.1.2.post1
Verifiers v0.1.2.post1 – Release Notes
Incremental update focused on a new stateful tool environment, environment folder cleanup/renaming, math verification robustness, reporting improvements, and bug fixes.
Highlights
- Stateful tools: add a stateful tool environment and move tool JSON loading into environment responses (PR #224).
- Environments: consolidation/renames for clarity and new environment tags (PR #222 and related changes).
- Lazy imports: training-related libraries are only imported when accessed
- Verification: more robust default math verification (PR #213).
- RL support: enable base-model RL with
message_type="completions"(PR #201), plus Prime-RL integration and docs (PR #204) and GRPO trainer updates (PR #217, #218). - Reporting & endpoints: template/report tweaks and endpoint path loading improvements (PR #206, PR #203, plus follow-ups).
- CLI/UX: make
richa default dependency for the eval script (PR #200); eval output refinements. - Fixes: hotfix for sampling args for
gpt-5.
Changes by Area
CLI and Scripts
- vf-eval
- Hotfixes
- Update sampling args for
gpt-5(hotfix commit).
- Update sampling args for
Environments and Examples
- Add a stateful tool environment; load tool information via environment responses (PR #224).
- Rename and consolidate environments, introduce tag metadata for discoverability (PR #222; additional env tag updates).
- Math environment updates and prompt tweaks.
- Remove dead processing code in
environment.py; general cleanup and type hint improvements.
Parsers, Rubrics, and Utils
- Caching improvements for JudgeRubric to reduce redundant work (PR #216).
- More robust rule-based math verification and heuristics (PR #213).
- General type-hint and internal cleanup passes.
Training
- Document Prime-RL training (PR #204).
- Minor updates to GRPO trainer (PR #217, #218).
- Add support for base-model RL flows via
message_type="completions"(PR #201).
Reporting and Tooling
- Report generation and template tweaks (PR #206, PR #203).
- Improve endpoint path loading and related tooling.
Documentation
- README and docs updates (minor) across environments and training utilities; additional guidance for reporting.
Upgrade Notes
- Environment renames/tags: if you reference environment names or use tags in tooling or scripts, review the updated names and tag metadata (PR #222).
Reference Commits (since v0.1.2.post0)
- adding stateful toolenv, moving tool json loading to env_response (PR #224)
- Will/eval outputs (PR #223)
- Update grpo_trainer.py (PR #217, PR #218)
- hotfix for gpt-5 sampling args
- Will/rename envs (PR #222)
- Will/judgerubric caching (PR #216)
- More robust rule-based math verification (PR #213)
- Report tweaks and endpoints path loading (PR #206 and follow-ups)
- Integrate and document prime-rl training (PR #204)
- Update report generation and vf-init template (PR #203)
- Add support for base model RL /
message_type="completions"(PR #201) - Add
richas default dependency for eval script (PR #200) - Math env updates, prompt tweaks, type hints, and cleanup in
environment.py
Full Changelog
v0.1.2.post0...HEAD: v0.1.2.post0...HEAD
v0.1.2.post0
Verifiers v0.1.2.post0 – Release Notes
Minor post-release update focusing on polish: CLI script bug fixes and enhancements, environment example cleanup, better reporting, and improved test coverage.
Highlights
- vf-eval: fixed rollout indexing bugs and improved reliability when sampling multiple rollouts.
- vf-init: streamlined project initialization and naming (removed automatic
vf-prefix) and refreshed templates. - Environments: documentation and prompt cleanups; added/updated AIME examples; improved report embedding.
- Tests: expanded coverage across rubric behavior, XML parser, and environment edge cases.
Changes by Area
CLI and Scripts
- vf-eval
- vf-init
- Remove automatic
vf-prefix during init to honor provided names (PR #190). - Update README template/content for new environments (multiple small tweaks).
- Remove automatic
Environments and Examples
- AIME 2024 / AIME 2025 updates (PR #199).
- Math Python example: prompt/readme/report cleanups.
- General environment cleanup and README refreshes across multiple examples.
- HotpotQA example: troubleshooting notes and minor fixes.
Parsers, Rubrics, and Utils
- XMLParser: fix handling of string completions during
parse_answer(PR #196). - Rubric: ensure error-handling behavior is well-covered by tests (PR #195).
- Reporting: improvements to report generation/embedding (
report_utils). - Dataset helpers: include metrics columns in outputs where expected (PR #194).
Tests
- Increase test coverage for:
- Rubric error handling (PR #195).
- XML parser behavior (new tests).
- Environment edge cases and extra scenarios.
Acknowledgements
Thank you to everyone who contributed to this minor release:
If we missed anyone, thank you as well—your contributions are appreciated.
Upgrade Notes
- No breaking API changes.
- When initializing a new environment with
vf-init, note the name is now used verbatim (no automaticvf-prefix, PR #190).
Reference Commits (since v0.1.2)
- Fix XMLParser string completion parsing (PR #196)
- Improve test coverage for Rubric error handling (PR #195)
- Include metrics columns in dataset outputs (PR #194)
- Fix vf-eval rollout index handling (PR #197)
- Remove automatic
vf-prefix from init (PR #190) - AIME 2024 / 2025 environments updates (PR #199)
- Environment README/reporting cleanups and misc improvements
Full Changelog
v0.1.2
What's changed
With the v0.1.2 release, verifiers is significantly more production-ready, and stable to build and train with. We appreciate everyone's patience with the changes and bug fixes thus far as we've addressed a number of long-time requests, and are excited to see what you all build with it!
Highlights:
- Proper encapsulation of Environments as standalone modules (see
environments/), which can contain their own dependencies in apyproject.toml, and need only to expose aload_environment(...) -> vf.Environmentfunction in order to be trainable. - Script flows for initializing (
vf-init), installing (vf-install), and evaluating (vf-eval) Environments before training. - Reorganization of examples and training scripts, removing lots of duplicated logic and creating a cleaner separation between library code and example code.
- Deprecation of the manual dynamically-batched
LLMinference worker in favor of properAsyncLLMsupport, allowing full control of native vLLM sampling parameters. - Support for native tool call parsing + parallel tool calls in
ToolEnv(replacing the manualXMLParserapproach). - Another trainer! Environments built with
verifiersare now trainable withprime-rl(as of 58ac91f forv0.1.2), which supports multi-node FSDP async training, is the primary RL framework used by the Prime Intellect research team, and is under ongoing development and stress-testing in advance of large-scale multi-environment training runs. - Pydantic types for core data classes used by Environments.
- Improvements to
GRPOTrainer, including supporting a singlemax_seq_lenoption (instead of separate prompt + completion lengths), and configurable turn length limits viamax_tokens. - Many more Environment examples.
- Improved logging and evaluation options.
- Overhauled README.md and docs.