-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --megascale_abort_on_hangs flag for multi-slice TPU jobs #731
base: main
Are you sure you want to change the base?
Conversation
* Introduce flag to terminate jobs on MegaScale Runtime Errors * Enable auto-restart of jax process when errors occur * Prevent silent hangs in multi-slice TPU configurations * Reduce time to recovery for failed jobs * ref: apple#716 * co-authored by Nick Stogner <[email protected]>
Please don't merge yet. Kyle is helping us testing this. |
Tested in internal environment by scheduling a multi slice v5p job in the internal environment test area. Job was able to make progress and the flag was set for the job using Isaack's branch for axlearn. |
# enabling this flag will allow for termination of the job, triggering | ||
# the process to exit. This is set to true to prevent the job from | ||
# silently hanging and to reduce time to recovery. | ||
megascale_abort_on_hangs="true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a XLA flag? Curious since other xla flags have xla_
prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a XLA compiler flags, but rather a libtpu runtime flag. As long as it eventually pass into LIBTPU_INIT_ARGS it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, this won't work with AOT compilation. Could you test the AOT compilation script run_aot_compilation.py
to confirm?
The reason I ask is the other megascale flags I have used don't work with AOT compilation.
If it doesn't work with AOT compilation, we can move the megascale flag to launch.py
.
# enabling this flag will allow for termination of the job, triggering | ||
# the process to exit. This is set to true to prevent the job from | ||
# silently hanging and to reduce time to recovery. | ||
megascale_abort_on_hangs="true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, this won't work with AOT compilation. Could you test the AOT compilation script run_aot_compilation.py
to confirm?
The reason I ask is the other megascale flags I have used don't work with AOT compilation.
If it doesn't work with AOT compilation, we can move the megascale flag to launch.py
.
Also do we know if there is there a list of libtpu-only (non-xla) flags, maybe with some brief description about what they do? |
BTW, thanks a lot for working on this! Getting the hanging situation improved is super valuable. |
Based on recent discussion,
|
@Ethanlm @markblee PTAL