Add logging for dataset input and checkpoint saving#1190
Add logging for dataset input and checkpoint saving#1190linxiulei wants to merge 2 commits intoapple:mainfrom
Conversation
axlearn/common/trainer.py
Outdated
|
|
||
| # Checkpointer policy will decide if we should save. | ||
| self.save_checkpoint(evaler_summaries=evaler_summaries) | ||
| with LoggingContext("save_checkpoint", timeout=180, threshold=120): |
There was a problem hiding this comment.
Can we make this configurable?
There was a problem hiding this comment.
Sure, but I am thinking which way is better to make it configurable, env var or flags?
There was a problem hiding this comment.
The typical convention we use is that if we want to make the child configurable, then the parent config should contain a config for the child.
| import time | ||
|
|
||
|
|
||
| class LoggingContext: |
There was a problem hiding this comment.
Could you explain the difference of this from the other monitor functionality that exists like GPUMonitor, etc? If they serve overlapping purposes, should they be unified?
There was a problem hiding this comment.
Can you please clarify what other monitor functionality we already have? If they overlap, we can unify them
There was a problem hiding this comment.
TfSummaryMonitor, GPUMonitor, GoodputMonitor. I might be forgetting some additional ones.
There was a problem hiding this comment.
Thanks, I checked those functionality and believe there don't overlap because this PR is mostly for logging specific operations' execution time (the utility function allows you to monitor any operations you want) and those functionality you mentioned is to monitor with specific purposes (goodputs, GPU hardware status etc)
| self.save_checkpoint(evaler_summaries=evaler_summaries) | ||
| with LoggingContext("save_checkpoint", timeout=180, threshold=120): | ||
| # Checkpointer policy will decide if we should save. | ||
| self.save_checkpoint(evaler_summaries=evaler_summaries) |
There was a problem hiding this comment.
Since checkpoint syncing is asynchronous, do we expect to see timeouts inside this block?
There was a problem hiding this comment.
Yes, IIUC checkpoint saving is asynchronous if the previous saving is complete or it may stuck in https://github.com/apple/axlearn/blob/main/axlearn/common/array_serialization.py#L554
axlearn/common/trainer.py
Outdated
| stop_trace_step = None | ||
|
|
||
| input_iterator = self.input.batches(self._input_iter) | ||
| input_iterator = LoggingIterator( |
There was a problem hiding this comment.
Since it is already recorded in the Goodput, I wonder if we need a separate logging here.
If I am reading it correctly, this LoggingIterator creates one timer thread every time next is called, which seems to be adding a lot of thread creation/cancellation overhead to the main process
There was a problem hiding this comment.
Assuming you mean this code
self._maybe_record_event(measurement.Event.START_DATA_LOADING)
try:
input_batch = next(input_iterator)
self._maybe_record_event(measurement.Event.END_DATA_LOADING)
If the next() somehow hangs and then the whole process aborts, would Goodput measurement help catch that?
My understanding is this iterator invokes next() every a few seconds depending on the training scale, so creating and cleaning a thread for every a few seconds should have only very minimal overhead (empirically one thread takes 10 to 100 usec to create).
However, if you still have concerns on overhead, we might create a reusable thread for timer purpose.
| """A context manager for monitoring the execution time of operations and | ||
| logging warnings based on defined timeouts and thresholds. | ||
|
|
||
| This context manager can be used to: | ||
| - Log a warning if an operation exceeds a specified timeout while still | ||
| running. | ||
| - Log a warning if an operation's total execution time exceeds a specified | ||
| threshold. |
There was a problem hiding this comment.
It's not clear that we need logging-upon-timeout, given the threading concerns raised by @Ethanlm. I wonder if a simpler semantic would be sufficient:
- LoggingContext always wait until the op finishes and log a warning if the execution time exceeds the threshold;
- We rely on the watchdog to tell us where the threads are stuck
axlearn/axlearn/common/trainer.py
Line 190 in 2c4ec6e
WDYT?
There was a problem hiding this comment.
I agree with starting something simpler so I changed this PR to only implement logging upon op completion.
However, I feel logging-upon-timeout is important because program hangs are difficult to debug by just looking through tracebacks of so many workers printed by watchdog timeout. A clear logging message may help identify workers that cause the hangs.
Introduce logging utilities designed to proactively identify and flag operations exceeding performance expectations.
| self._maybe_record_event(measurement.Event.START_DATA_LOADING) | ||
| try: | ||
| input_batch = next(input_iterator) | ||
| with LoggingContext("input_iterator", threshold=60): |
There was a problem hiding this comment.
Adding "step" information in the log can be useful when we try to debug
|
|
||
| # Checkpointer policy will decide if we should save. | ||
| self.save_checkpoint(evaler_summaries=evaler_summaries) | ||
| with LoggingContext("save_checkpoint", threshold=120): |
There was a problem hiding this comment.
Can we make it configurable?
We should be able to patch this ad-hoc in jobs we want to debug and monitor closely. But the hard-coded 120s here makes it less flexible to apply in general jobs, hence not a good candidate for PR merge
|
This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the |
|
This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the |
|
This pull request was closed because it has been inactive for more than 7 days since being marked as stale. Please feel free to reopen it if you would like to continue. |
This PR adds: