Skip to content

Commit 699a8e2

Browse files
gitttt-1234claude
andauthored
Fix DDP synchronization in training callbacks (#437)
## Summary - Fix multiple callbacks that caused training to freeze/deadlock after each epoch when using multi-GPU DDP training (`trainer_devices > 1`) - The issue was that some processes reached `trainer.strategy.barrier()` while others returned early, causing infinite waits ## Problem When running multi-GPU DDP training, training would appear to freeze after each epoch. This was caused by DDP synchronization bugs in several callbacks where: 1. **Non-rank-0 processes** returned early without reaching the barrier 2. **Rank-0** returned early (e.g., when wandb logger was None) without reaching the barrier 3. **Missing barriers** in some callbacks entirely This caused deadlocks because `trainer.strategy.barrier()` requires ALL ranks to reach the same point. ## Callbacks Fixed | Callback | Issue | Fix | |----------|-------|-----| | `UnifiedVizCallback` | Early return for non-rank-0 skipped barrier | Changed to `if trainer.is_global_zero:` block pattern | | `EpochEndEvaluationCallback` | No barrier at all, multiple early returns | Added barrier, restructured to avoid early returns | | `WandBVizCallback` | Early return inside rank-0 check skipped barrier | Replaced early return with conditional block | | `WandBVizCallbackWithPAFs` | Same as above | Same fix | ## Correct Pattern The correct pattern (used by working callbacks like `CSVLoggerCallback`, `MatplotlibSaver`, `TrainingControllerZMQ`): ```python def on_train_epoch_end(self, trainer, pl_module): if trainer.is_global_zero: # rank-0 only work (no early returns!) ... # ALL ranks ALWAYS reach this barrier trainer.strategy.barrier() ``` ## Test plan - [ ] Manually tested on 4x NVIDIA A40 system with `trainer_devices=2` and `log_viz=true` - [ ] Verified epochs complete without freezing - [ ] No automated multi-GPU test (CI lacks multi-GPU environment) ## Related - Follows up on #436 which fixed the duplicate GPU detection error - Investigation documented in `scratch/2026-01-23-multi-gpu-ddp-investigation/` 🤖 Generated with [Claude Code](https://claude.ai/claude-code) --------- Co-authored-by: Claude Opus 4.5 <[email protected]>
1 parent aa32c1a commit 699a8e2

File tree

1 file changed

+199
-185
lines changed

1 file changed

+199
-185
lines changed

0 commit comments

Comments
 (0)