Commit 99c1254
Fix DDP synchronization in training callbacks
Fix multiple callbacks that caused training to freeze/deadlock after each
epoch when using multi-GPU DDP training. The issue was that some processes
reached trainer.strategy.barrier() while others returned early, causing
infinite waits.
Callbacks fixed:
- UnifiedVizCallback: Changed early return pattern to use if block
- EpochEndEvaluationCallback: Added missing barrier, restructured to avoid
early returns
- WandBVizCallback: Replaced early return with conditional block
- WandBVizCallbackWithPAFs: Same fix as WandBVizCallback
The correct pattern (used by working callbacks like CSVLoggerCallback,
MatplotlibSaver, TrainingControllerZMQ) is:
if trainer.is_global_zero:
# rank-0 only work (no early returns!)
trainer.strategy.barrier() # ALL ranks must reach this
Co-Authored-By: Claude Opus 4.5 <[email protected]>1 parent aa32c1a commit 99c1254
1 file changed
+200
-185
lines changed
0 commit comments