-
Notifications
You must be signed in to change notification settings - Fork 346
Workload Hang & Rolling Window Goodput Monitoring Support #1278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Ruoming Pang <[email protected]>
Co-authored-by: Ruoming Pang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits comment. Overall LGTM. I am going to make e2e testing before approval.
record_event_start(*args, **kwargs) | ||
except (TypeError, ValueError, RuntimeError) as e: | ||
logging.warning( | ||
"Failed to record start of event %s. Error: %s", event.name, e, exc_info=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event.name
should be event.value
.
record_event_end(*args, **kwargs) | ||
except (TypeError, ValueError, RuntimeError) as e: | ||
logging.warning( | ||
"Failed to record end of event %s. Error: %s", event.name, e, exc_info=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, should use event.value
instead of event.name
try: | ||
if record_event_end: | ||
record_event_end(*args, **kwargs) | ||
except (TypeError, ValueError, RuntimeError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use RuntimeError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Manually tested and bumped tested.
@@ -4,7 +4,7 @@ | |||
|
|||
Example: | |||
|
|||
# Enable Goodput when launching an AXLearn training job | |||
# Enable Goodput when launching an AxLearn training job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unintended change?
enable_gcp_goodput_metrics: bool = True | ||
enable_pathways_goodput: bool = False | ||
include_badput_breakdown: bool = True | ||
enable_rolling_window_goodput_monitoring: bool = False | ||
rolling_window_size: Sequence[int] = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid adding these kinds of bool feature flags in the API because they quickly become unmaintainable as we need to account for the cross product of all bool interactions.
Can enable_rolling_window_goodput_monitoring
be inferred from whether len(rolling_window_size) > 0
?
Can enable_pathways_goodput
be inferred automatically if we detect proxy backend?
When would we want include_badput_breakdown
to be False?
When would we want enable_gcp_goodput_metrics
to be False? Should the user simply not configure this recorder if goodput metrics are not desired?
Overview
This PR introduces significant enhancements to our Goodput measurement system, focusing on improved monitoring capabilities and robustness for long-running AxLearn training jobs.
Specifically, it adds:
We also switch to context managers to demarcate start and end of recording and monitoring events. These features aim to provide more granular, real-time insights into training efficiency and promptly identify potential issues, leading to more stable and performant training workflows.
Testing
These monitoring features have been rigorously tested on long-running AxLearn training jobs to ensure their stability and accuracy under realistic conditions.
Specifically, validation runs on Fuji 7B & test models, example results: