Workload Hang & Rolling Window Goodput Monitoring Support #1278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

dipannita08 wants to merge 20 commits into apple:main from dipannita08:main

+655 −527

Contributor

dipannita08 commented Jun 27, 2025

Overview

This PR introduces significant enhancements to our Goodput measurement system, focusing on improved monitoring capabilities and robustness for long-running AxLearn training jobs.

Specifically, it adds:

Workload Hang Monitoring Support: Detects and flags instances where the training workload might be stuck or unresponsive.
Rolling Window Goodput Monitoring Support: Provides insights into goodput over dynamic, recent timeframes, complementing cumulative goodput metrics.
Final Upload & Safe Exit: Ensures that all outstanding metrics are flushed and monitoring processes gracefully terminate upon job completion or shutdown.
Updated Documents & Dashboard Templates: Documentation is updated with new metrics and points to custom dashboard templates for GCM.
More Disruption, Checkpointing & GCS Badput Breakdown: More Badput Buckets

We also switch to context managers to demarcate start and end of recording and monitoring events. These features aim to provide more granular, real-time insights into training efficiency and promptly identify potential issues, leading to more stable and performant training workflows.

Testing

These monitoring features have been rigorously tested on long-running AxLearn training jobs to ensure their stability and accuracy under realistic conditions.

Specifically, validation runs on Fuji 7B & test models, example results:

Rolling Window Goodput [1d, 3d, 5d]:
Workload Hang Monitoring:
Cumulative Goodput:
Cumulative Badput:

dipannita08 and others added 20 commits

October 21, 2024 23:44


          Code clean up

bcd8618


          Add goodput and badput monitoring support to AxLearn

d8474f7


          Merge remote-tracking branch 'upstream/main'

5c62244


          Merge remote-tracking branch 'upstream/main'

1dff92c


          Add more testing


          Address comments

f5d6a37


          Fix docstrings


          Remove recorder calls from trainer for now

8d0c58d


          Code cleanup gcp/measurement.py

31eb0e1

Co-authored-by: Ruoming Pang <[email protected]>


          Code cleanup common/measurement.py

Co-authored-by: Ruoming Pang <[email protected]>


          Merge remote-tracking branch 'upstream/main'

380dcac


          Fix pre commit errors

0e9f4dc


          Adding more tests

7bd0fc8


          Merge remote-tracking branch 'upstream/main'

62bf113


          Further clean up

eeef352


          Fix a test error

878a26e


          Merge remote-tracking branch 'upstream/main'

42d9445


          Merge remote-tracking branch 'upstream/main'

b31c5c4


          Add workload hang monitoring & rolling window goodput support


          Merge remote-tracking branch 'upstream/main'

30586de

dipannita08 requested review from a team, ruomingp and markblee as code owners

June 27, 2025 23:47

findmyway self-assigned this

amcw7777 reviewed

View reviewed changes

Contributor

amcw7777 left a comment

Left some nits comment. Overall LGTM. I am going to make e2e testing before approval.

axlearn/cloud/gcp/measurement.py

+                              record_event_start(*args, **kwargs)
+                      except (TypeError, ValueError, RuntimeError) as e:
+                          logging.warning(
+                              "Failed to record start of event %s. Error: %s", event.name, e, exc_info=True

Contributor

amcw7777 Jun 30, 2025

event.name should be event.value.

axlearn/cloud/gcp/measurement.py

+                                  record_event_end(*args, **kwargs)
+                          except (TypeError, ValueError, RuntimeError) as e:
+                              logging.warning(
+                                  "Failed to record end of event %s. Error: %s", event.name, e, exc_info=True

Contributor

amcw7777 Jun 30, 2025

Ditto, should use event.value instead of event.name

axlearn/cloud/gcp/measurement.py

+                          try:
+                              if record_event_end:
+                                  record_event_end(*args, **kwargs)
+                          except (TypeError, ValueError, RuntimeError) as e:

Contributor

amcw7777 Jun 30, 2025

nit: use RuntimeError?

amcw7777 approved these changes

View reviewed changes

Contributor

amcw7777 left a comment

LGTM.
Manually tested and bumped tested.

markblee reviewed

View reviewed changes

axlearn/cloud/gcp/measurement.py

@@ @@ -4,7 +4,7 @@ @@
                   Example:
-                  # Enable Goodput when launching an AXLearn training job
+                  # Enable Goodput when launching an AxLearn training job

Contributor

markblee Jul 1, 2025

Unintended change?

axlearn/cloud/gcp/measurement.py

Comment on lines +57 to +61

+                      enable_gcp_goodput_metrics: bool = True
+                      enable_pathways_goodput: bool = False
+                      include_badput_breakdown: bool = True
+                      enable_rolling_window_goodput_monitoring: bool = False
+                      rolling_window_size: Sequence[int] = ()

Contributor

markblee Jul 1, 2025

We try to avoid adding these kinds of bool feature flags in the API because they quickly become unmaintainable as we need to account for the cross product of all bool interactions.

Can enable_rolling_window_goodput_monitoring be inferred from whether len(rolling_window_size) > 0?
Can enable_pathways_goodput be inferred automatically if we detect proxy backend?
When would we want include_badput_breakdown to be False?
When would we want enable_gcp_goodput_metrics to be False? Should the user simply not configure this recorder if goodput metrics are not desired?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet