Commit 699a8e2
Fix DDP synchronization in training callbacks (#437)
## Summary
- Fix multiple callbacks that caused training to freeze/deadlock after
each epoch when using multi-GPU DDP training (`trainer_devices > 1`)
- The issue was that some processes reached `trainer.strategy.barrier()`
while others returned early, causing infinite waits
## Problem
When running multi-GPU DDP training, training would appear to freeze
after each epoch. This was caused by DDP synchronization bugs in several
callbacks where:
1. **Non-rank-0 processes** returned early without reaching the barrier
2. **Rank-0** returned early (e.g., when wandb logger was None) without
reaching the barrier
3. **Missing barriers** in some callbacks entirely
This caused deadlocks because `trainer.strategy.barrier()` requires ALL
ranks to reach the same point.
## Callbacks Fixed
| Callback | Issue | Fix |
|----------|-------|-----|
| `UnifiedVizCallback` | Early return for non-rank-0 skipped barrier |
Changed to `if trainer.is_global_zero:` block pattern |
| `EpochEndEvaluationCallback` | No barrier at all, multiple early
returns | Added barrier, restructured to avoid early returns |
| `WandBVizCallback` | Early return inside rank-0 check skipped barrier
| Replaced early return with conditional block |
| `WandBVizCallbackWithPAFs` | Same as above | Same fix |
## Correct Pattern
The correct pattern (used by working callbacks like `CSVLoggerCallback`,
`MatplotlibSaver`, `TrainingControllerZMQ`):
```python
def on_train_epoch_end(self, trainer, pl_module):
if trainer.is_global_zero:
# rank-0 only work (no early returns!)
...
# ALL ranks ALWAYS reach this barrier
trainer.strategy.barrier()
```
## Test plan
- [ ] Manually tested on 4x NVIDIA A40 system with `trainer_devices=2`
and `log_viz=true`
- [ ] Verified epochs complete without freezing
- [ ] No automated multi-GPU test (CI lacks multi-GPU environment)
## Related
- Follows up on #436 which fixed the duplicate GPU detection error
- Investigation documented in
`scratch/2026-01-23-multi-gpu-ddp-investigation/`
🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <[email protected]>1 parent aa32c1a commit 699a8e2
1 file changed
+199
-185
lines changed
0 commit comments