Update inference to resume from temporary result file when possible #1734

jgreer013 · 2025-06-05T21:05:35Z

Description

Inference jobs write results to a temporary file throughout, but we should also be resuming from said files when restarting the job.

This PR loads the scratch file before inference, skips any inputs that have already completed, runs inference, then reads all results from the scratch file, cleans up the scratch file, then returns.

Related issues

Towards OPE-1307
Fixes OPE-1257

Before submitting

This PR only changes documentation. (You can ignore the following checks in that case)
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

…or uniqueness

Copilot

Pull Request Overview

This PR introduces a resume mechanism for inference jobs by leveraging temporary scratch files, reducing redundant computations after failures. Key changes include refactoring tests to use the unified infer method, adding scratch file handling in multiple inference engines, and integrating resume logic in the base inference engine.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/inference/test_vllm_inference_engine.py	Removed direct infer_from_file call to use unified infer method for resumed inference
tests/unit/inference/test_remote_inference_engine.py	Updated inference invocation to reflect the new resume mechanism
tests/unit/inference/test_generation_params.py	Added mocks for tokenizer.batch_decode to support new generation behavior
tests/unit/inference/test_base_inference_engine.py	Added tests to validate scratch file creation, resume logic, and cleanup behavior
tests/integration/infer/test_infer.py	Adjusted conversation comparison and removed redundant GPU marker
src/oumi/inference/{vllm, native_text, llama_cpp}_inference_engine.py	Removed direct cleanup calls to centralize scratch file handling in the base engine
src/oumi/core/inference/base_inference_engine.py	Integrated scratch file loading, filtering of completed conversations, and deterministic conversation ID generation

Comments suppressed due to low confidence (2)

tests/unit/inference/test_vllm_inference_engine.py:429

The direct call to infer_from_file was removed; ensure that the unified infer method adequately tests the resume-from-scratch behavior and that all expected outcomes are validated elsewhere in the suite.

-        result = engine.infer_from_file(str(input_path), inference_config)

src/oumi/core/inference/base_inference_engine.py:309

The computation of the inference_hash in _get_scratch_filepath depends on _dataset_hash; ensure that _dataset_hash is correctly computed in all cases (even when the input conversation list is empty) to avoid potential hash collisions or unexpected file paths.

return str(Path.cwd() / "tmp" / f"temp_inference_output_{inference_hash}.jsonl")

src/oumi/core/inference/base_inference_engine.py

Co-authored-by: Copilot <[email protected]>

wizeng23

Could you confirm that this was tested? Ideally both a local and remote inference engine are tested.

src/oumi/core/inference/base_inference_engine.py

jgreer013 · 2025-06-18T22:58:23Z

Could you confirm that this was tested? Ideally both a local and remote inference engine are tested.

Can confirm with local from manual exception throwing:

[base_inference_engine.py:158] Found 2 completed conversations. Processing remaining 0 conversations.

src/oumi/core/inference/base_inference_engine.py

tests/unit/inference/test_base_inference_engine.py

src/oumi/core/inference/base_inference_engine.py

jgreer013 added 17 commits June 5, 2025 14:03

Update inference to resume from temporary result file when possible

b81796d

Add cleanup when all convs have been processed

5b0bb8d

Merge branch 'main' into jgreer013/inference-start-from-temp

1c857ee

Remove check for empty remaining conversations for unit tests

4060a4c

Add sorting to maintain order of final results

ad33929

Remove redundant inference call

4f5fffd

Add handling of temporary inference output for vllm tests

ee0e17f

Merge branch 'main' into jgreer013/inference-start-from-temp

f0cee48

Try cleanup scratch

9bc6c8e

Try change working directory

5a38223

Reset wd back after test

1ff99ae

Try context-specific cwd

58e53a8

Update temporary file to use hash of model, generation, and dataset f…

1e8992a

…or uniqueness

Make setup and teardown automatic

479238c

Write vllm results to scratch and fix tests

60754ee

Update integration tests to ignore conversation id when not defined

a6913de

Update test to only use conv id when present for comparison

3ac3d23

jgreer013 requested review from oelachqar, taenin and Copilot June 9, 2025 16:36

Merge branch 'main' into jgreer013/inference-start-from-temp

b3bf7c2

Copilot AI reviewed Jun 9, 2025

View reviewed changes

src/oumi/core/inference/base_inference_engine.py Outdated Show resolved Hide resolved

jgreer013 and others added 6 commits June 9, 2025 09:48

Update src/oumi/core/inference/base_inference_engine.py

153eb52

Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into jgreer013/inference-start-from-temp

8525365

Merge branch 'main' into jgreer013/inference-start-from-temp

8fcb3c3

Merge branch 'main' into jgreer013/inference-start-from-temp

693f8db

Merge branch 'main' into jgreer013/inference-start-from-temp

84dd8a6

Merge branch 'main' into jgreer013/inference-start-from-temp

07d4fe7

wizeng23 approved these changes Jun 18, 2025

View reviewed changes

src/oumi/core/inference/base_inference_engine.py Outdated Show resolved Hide resolved

src/oumi/core/inference/base_inference_engine.py Show resolved Hide resolved

Hash rows in dataset and switch to home directory

e8e5383

taenin reviewed Jun 19, 2025

View reviewed changes

jgreer013 added 2 commits June 20, 2025 09:12

Addressed comments

c1ced48

Merge branch 'main' into jgreer013/inference-start-from-temp

9cb5a9b

jgreer013 merged commit 15a6c68 into main Jun 20, 2025
5 checks passed

jgreer013 deleted the jgreer013/inference-start-from-temp branch June 20, 2025 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update inference to resume from temporary result file when possible #1734

Update inference to resume from temporary result file when possible #1734

Uh oh!

jgreer013 commented Jun 5, 2025 •

edited by wizeng23

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

wizeng23 left a comment

Uh oh!

Uh oh!

Uh oh!

jgreer013 commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update inference to resume from temporary result file when possible #1734

Update inference to resume from temporary result file when possible #1734

Uh oh!

Conversation

jgreer013 commented Jun 5, 2025 • edited by wizeng23 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Before submitting

Reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

wizeng23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jgreer013 commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgreer013 commented Jun 5, 2025 •

edited by wizeng23

Loading