Skip to content

Workload Hang & Rolling Window Goodput Monitoring Support #1278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

dipannita08
Copy link
Contributor

Overview

This PR introduces significant enhancements to our Goodput measurement system, focusing on improved monitoring capabilities and robustness for long-running AxLearn training jobs.

Specifically, it adds:

  1. Workload Hang Monitoring Support: Detects and flags instances where the training workload might be stuck or unresponsive.
  2. Rolling Window Goodput Monitoring Support: Provides insights into goodput over dynamic, recent timeframes, complementing cumulative goodput metrics.
  3. Final Upload & Safe Exit: Ensures that all outstanding metrics are flushed and monitoring processes gracefully terminate upon job completion or shutdown.
  4. Updated Documents & Dashboard Templates: Documentation is updated with new metrics and points to custom dashboard templates for GCM.
  5. More Disruption, Checkpointing & GCS Badput Breakdown: More Badput Buckets

We also switch to context managers to demarcate start and end of recording and monitoring events. These features aim to provide more granular, real-time insights into training efficiency and promptly identify potential issues, leading to more stable and performant training workflows.

Testing

These monitoring features have been rigorously tested on long-running AxLearn training jobs to ensure their stability and accuracy under realistic conditions.

Specifically, validation runs on Fuji 7B & test models, example results:

  • Rolling Window Goodput [1d, 3d, 5d]:
    image
  • Workload Hang Monitoring:
    image
  • Cumulative Goodput:
    image
  • Cumulative Badput:
    image

@dipannita08 dipannita08 requested review from a team, ruomingp and markblee as code owners June 27, 2025 23:47
@findmyway findmyway self-assigned this Jun 30, 2025
Copy link
Contributor

@amcw7777 amcw7777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits comment. Overall LGTM. I am going to make e2e testing before approval.

record_event_start(*args, **kwargs)
except (TypeError, ValueError, RuntimeError) as e:
logging.warning(
"Failed to record start of event %s. Error: %s", event.name, e, exc_info=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event.name should be event.value.

record_event_end(*args, **kwargs)
except (TypeError, ValueError, RuntimeError) as e:
logging.warning(
"Failed to record end of event %s. Error: %s", event.name, e, exc_info=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, should use event.value instead of event.name

try:
if record_event_end:
record_event_end(*args, **kwargs)
except (TypeError, ValueError, RuntimeError) as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use RuntimeError?

Copy link
Contributor

@amcw7777 amcw7777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Manually tested and bumped tested.

@@ -4,7 +4,7 @@

Example:

# Enable Goodput when launching an AXLearn training job
# Enable Goodput when launching an AxLearn training job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended change?

Comment on lines +57 to +61
enable_gcp_goodput_metrics: bool = True
enable_pathways_goodput: bool = False
include_badput_breakdown: bool = True
enable_rolling_window_goodput_monitoring: bool = False
rolling_window_size: Sequence[int] = ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid adding these kinds of bool feature flags in the API because they quickly become unmaintainable as we need to account for the cross product of all bool interactions.

Can enable_rolling_window_goodput_monitoring be inferred from whether len(rolling_window_size) > 0?
Can enable_pathways_goodput be inferred automatically if we detect proxy backend?
When would we want include_badput_breakdown to be False?
When would we want enable_gcp_goodput_metrics to be False? Should the user simply not configure this recorder if goodput metrics are not desired?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants