Skip to content

Conversation

@LINYV0719
Copy link

Description

This PR addresses the TODO in orbit/controller.py to support steps=-1 in Controller.train(), allowing training to run until the underlying dataset is exhausted.

Motivation: Previously, Controller.train required a fixed number of steps. This change allows users to train for a full epoch (or until the dataset runs out) without needing to know the exact dataset size beforehand, which is common when using tf.data.Dataset.

Changes:
-Modified Controller.train loop condition to accept steps=-1.
-Added a try-except block to catch tf.errors.OutOfRangeError and StopIteration during _train_n_steps. This ensures the loop exits gracefully when the iterator is exhausted instead of crashing.
-Added logic to break the loop if the global_step increment is less than expected (another indicator of exhaustion).
-Added a new test case test_train_until_exhaustion in orbit/controller_test.py to verify this behavior using a finite dataset.

Type of change

For a new feature or function, please create an issue first to discuss it
with us before submitting a pull request.

Note: Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Tests

I verified the changes by running the new test case and existing tests.

Test Configuration:

OS: Windows 11
Python Version: 3.10
Command: python -m orbit.controller_test
Result: Passed. specifically, test_train_until_exhaustion passed with the expected behavior

Checklist

@LINYV0719 LINYV0719 requested a review from a team as a code owner December 29, 2025 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant