Skip to content

Commit 99c1254

Browse files
gitttt-1234claude
andcommitted
Fix DDP synchronization in training callbacks
Fix multiple callbacks that caused training to freeze/deadlock after each epoch when using multi-GPU DDP training. The issue was that some processes reached trainer.strategy.barrier() while others returned early, causing infinite waits. Callbacks fixed: - UnifiedVizCallback: Changed early return pattern to use if block - EpochEndEvaluationCallback: Added missing barrier, restructured to avoid early returns - WandBVizCallback: Replaced early return with conditional block - WandBVizCallbackWithPAFs: Same fix as WandBVizCallback The correct pattern (used by working callbacks like CSVLoggerCallback, MatplotlibSaver, TrainingControllerZMQ) is: if trainer.is_global_zero: # rank-0 only work (no early returns!) trainer.strategy.barrier() # ALL ranks must reach this Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent aa32c1a commit 99c1254

File tree

1 file changed

+200
-185
lines changed

1 file changed

+200
-185
lines changed

0 commit comments

Comments
 (0)