Fix DDP synchronization in training callbacks #437

gitttt-1234 · 2026-01-23T18:17:16Z

Summary

Fix multiple callbacks that caused training to freeze/deadlock after each epoch when using multi-GPU DDP training (trainer_devices > 1)
The issue was that some processes reached trainer.strategy.barrier() while others returned early, causing infinite waits

Problem

When running multi-GPU DDP training, training would appear to freeze after each epoch. This was caused by DDP synchronization bugs in several callbacks where:

Non-rank-0 processes returned early without reaching the barrier
Rank-0 returned early (e.g., when wandb logger was None) without reaching the barrier
Missing barriers in some callbacks entirely

This caused deadlocks because trainer.strategy.barrier() requires ALL ranks to reach the same point.

Callbacks Fixed

Callback	Issue	Fix
`UnifiedVizCallback`	Early return for non-rank-0 skipped barrier	Changed to `if trainer.is_global_zero:` block pattern
`EpochEndEvaluationCallback`	No barrier at all, multiple early returns	Added barrier, restructured to avoid early returns
`WandBVizCallback`	Early return inside rank-0 check skipped barrier	Replaced early return with conditional block
`WandBVizCallbackWithPAFs`	Same as above	Same fix

Correct Pattern

The correct pattern (used by working callbacks like CSVLoggerCallback, MatplotlibSaver, TrainingControllerZMQ):

def on_train_epoch_end(self, trainer, pl_module):
    if trainer.is_global_zero:
        # rank-0 only work (no early returns!)
        ...
    # ALL ranks ALWAYS reach this barrier
    trainer.strategy.barrier()

Test plan

Manually tested on 4x NVIDIA A40 system with trainer_devices=2 and log_viz=true
Verified epochs complete without freezing
No automated multi-GPU test (CI lacks multi-GPU environment)

Fix multiple callbacks that caused training to freeze/deadlock after each epoch when using multi-GPU DDP training. The issue was that some processes reached trainer.strategy.barrier() while others returned early, causing infinite waits. Callbacks fixed: - UnifiedVizCallback: Changed early return pattern to use if block - EpochEndEvaluationCallback: Added missing barrier, restructured to avoid early returns - WandBVizCallback: Replaced early return with conditional block - WandBVizCallbackWithPAFs: Same fix as WandBVizCallback The correct pattern (used by working callbacks like CSVLoggerCallback, MatplotlibSaver, TrainingControllerZMQ) is: if trainer.is_global_zero: # rank-0 only work (no early returns!) trainer.strategy.barrier() # ALL ranks must reach this Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov · 2026-01-23T18:22:03Z

Codecov Report

❌ Patch coverage is 82.35294% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.56%. Comparing base (ff91433) to head (ae30518).
⚠️ Report is 124 commits behind head on main.

Files with missing lines	Patch %	Lines
sleap_nn/training/callbacks.py	82.35%	15 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (ff91433) and HEAD (ae30518). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (ff91433) HEAD (ae30518)

4 3

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #437       +/-   ##
===========================================
- Coverage   95.28%   84.56%   -10.72%     
===========================================
  Files          49       74       +25     
  Lines        6765    10770     +4005     
===========================================
+ Hits         6446     9108     +2662     
- Misses        319     1662     +1343

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Format file

ae30518

gitttt-1234 merged commit 699a8e2 into main Jan 23, 2026
7 checks passed

gitttt-1234 deleted the fix/ddp-callback-synchronization branch January 23, 2026 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DDP synchronization in training callbacks #437

Fix DDP synchronization in training callbacks #437

Uh oh!

gitttt-1234 commented Jan 23, 2026

Uh oh!

codecov bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix DDP synchronization in training callbacks #437

Fix DDP synchronization in training callbacks #437

Uh oh!

Conversation

gitttt-1234 commented Jan 23, 2026

Summary

Problem

Callbacks Fixed

Correct Pattern

Test plan

Related

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jan 23, 2026 •

edited

Loading