Commit 99c1254

and

committed

Fix DDP synchronization in training callbacks

Fix multiple callbacks that caused training to freeze/deadlock after each epoch when using multi-GPU DDP training. The issue was that some processes reached trainer.strategy.barrier() while others returned early, causing infinite waits. Callbacks fixed: - UnifiedVizCallback: Changed early return pattern to use if block - EpochEndEvaluationCallback: Added missing barrier, restructured to avoid early returns - WandBVizCallback: Replaced early return with conditional block - WandBVizCallbackWithPAFs: Same fix as WandBVizCallback The correct pattern (used by working callbacks like CSVLoggerCallback, MatplotlibSaver, TrainingControllerZMQ) is: if trainer.is_global_zero: # rank-0 only work (no early returns!) trainer.strategy.barrier() # ALL ranks must reach this Co-Authored-By: Claude Opus 4.5 <[email protected]>

1 parent aa32c1a commit 99c1254Copy full SHA for 99c1254

1 file changed

+200

-185

lines changed

sleap_nn/training
- callbacks.py

1 file changed

+200

-185

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 99c1254

1 file changed

1 file changed

File tree

1 file changed

1 file changed

0 commit comments