Elastic Training Fast Resume #1769
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
These changes are for the fast resume mode of elastic training where training is paused if a slice is lost and resumes when the slice has returned. It is intended to make workload startup time after hardware failure much quicker because there is no need to communicate to GCS to restore from a checkpoint. It accomplishes this by not restarting the workload and only restarting the affected slices.
Tests
It is not thoroughly tested at this state but it it has successfully fast resumed from a simulated slice failure.
Checklist
Before submitting this PR, please make sure (put X in square brackets):