Allow resuming vf-eval #557

mikasenghaas · 2025-11-13T22:09:28Z

Description

This PR adds support for resuming an evaluation run via vf-eval using the -R (--resume-from-path) flag which is supposed to be used in conjunction with -s and -f to allow to resume an evaluation run. It will:

Load all completed example_ids from results.jsonl in the specified resume path
Skip the already finished rollouts from the eval inputs
Append new rollout results to results.jsonl and update the metadata (e.g. timings are summed, avg. reward and metrics are averaged)

Notes:

This is a not a "full" checkpointing since we do not recover mid-trajectory failures but its the simplest solution to quickly resume a . Useful in long-running evals or synthetic data gen.
Intermediate saving is only supported on the group level. If one needs rollout-level intermediate saving, it makes sense to duplicate prompts before running vf-eval and run with rollouts_per_example=1

Examples

Default behavior is unchanged

uv run vf-eval math500 -m gpt-5-nano -s

To use the resume feature and save each finished group, specify -s and -f 1. To illustrate, I sys.exit after every intermediate saving. For the initial command I run:

uv run vf-eval math500 -m gpt-5-nano -n 3 -r 1 -s -f 1

Then, each subsequent resume from the output dir

uv run vf-eval math500 -m gpt-5-nano -n 3 -r 1 -s -f 1 -R outputs/evals/math500--gpt-5-nano/67681b1d

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

mikasenghaas added 9 commits November 13, 2025 21:37

First draft of resume vf-eval from outputs path

9739379

Add rollout id to generate outputs

46713ab

Add rollout id into make_dataset as well

46b5a53

Append/ update rollout results if they already exist

5017c79

Filter on group level

776f7d5

Fix log

965dfd9

Exit if not inputs left

3f49597

Fix tests

0664df8

Remove debug stuff

0a60f47

mikasenghaas marked this pull request as ready for review November 13, 2025 22:29

mikasenghaas added 2 commits November 14, 2025 01:26

Deduplicate to prevent double saving of samples

5e861d2

Silent filter

2b43c7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow resuming vf-eval #557

Allow resuming vf-eval #557

Uh oh!

mikasenghaas commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Allow resuming vf-eval #557

Are you sure you want to change the base?

Allow resuming vf-eval #557

Uh oh!

Conversation

mikasenghaas commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Examples

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Nov 13, 2025 •

edited

Loading