Fix DDP synchronization in training callbacks #437
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
trainer_devices > 1)trainer.strategy.barrier()while others returned early, causing infinite waitsProblem
When running multi-GPU DDP training, training would appear to freeze after each epoch. This was caused by DDP synchronization bugs in several callbacks where:
This caused deadlocks because
trainer.strategy.barrier()requires ALL ranks to reach the same point.Callbacks Fixed
UnifiedVizCallbackif trainer.is_global_zero:block patternEpochEndEvaluationCallbackWandBVizCallbackWandBVizCallbackWithPAFsCorrect Pattern
The correct pattern (used by working callbacks like
CSVLoggerCallback,MatplotlibSaver,TrainingControllerZMQ):Test plan
trainer_devices=2andlog_viz=trueRelated
scratch/2026-01-23-multi-gpu-ddp-investigation/🤖 Generated with Claude Code