Skip to content

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Nov 13, 2025

Description

This PR adds support for resuming an evaluation run via vf-eval using the -R (--resume-from-path) flag which is supposed to be used in conjunction with -s and -f to allow to resume an evaluation run. It will:

  • Load all completed example_ids from results.jsonl in the specified resume path
  • Skip the already finished rollouts from the eval inputs
  • Append new rollout results to results.jsonl and update the metadata (e.g. timings are summed, avg. reward and metrics are averaged)

Notes:

  • This is a not a "full" checkpointing since we do not recover mid-trajectory failures but its the simplest solution to quickly resume a . Useful in long-running evals or synthetic data gen.
  • Intermediate saving is only supported on the group level. If one needs rollout-level intermediate saving, it makes sense to duplicate prompts before running vf-eval and run with rollouts_per_example=1

Examples

Default behavior is unchanged

uv run vf-eval math500 -m gpt-5-nano -s

To use the resume feature and save each finished group, specify -s and -f 1. To illustrate, I sys.exit after every intermediate saving. For the initial command I run:

uv run vf-eval math500 -m gpt-5-nano -n 3 -r 1 -s -f 1

Then, each subsequent resume from the output dir

uv run vf-eval math500 -m gpt-5-nano -n 3 -r 1 -s -f 1 -R outputs/evals/math500--gpt-5-nano/67681b1d
Screenshot 2025-11-13 at 2 22 45 PM

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

@mikasenghaas mikasenghaas marked this pull request as ready for review November 13, 2025 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants