Skip to content

Conversation

@gitttt-1234
Copy link
Collaborator

Summary

  • Fix multiple callbacks that caused training to freeze/deadlock after each epoch when using multi-GPU DDP training (trainer_devices > 1)
  • The issue was that some processes reached trainer.strategy.barrier() while others returned early, causing infinite waits

Problem

When running multi-GPU DDP training, training would appear to freeze after each epoch. This was caused by DDP synchronization bugs in several callbacks where:

  1. Non-rank-0 processes returned early without reaching the barrier
  2. Rank-0 returned early (e.g., when wandb logger was None) without reaching the barrier
  3. Missing barriers in some callbacks entirely

This caused deadlocks because trainer.strategy.barrier() requires ALL ranks to reach the same point.

Callbacks Fixed

Callback Issue Fix
UnifiedVizCallback Early return for non-rank-0 skipped barrier Changed to if trainer.is_global_zero: block pattern
EpochEndEvaluationCallback No barrier at all, multiple early returns Added barrier, restructured to avoid early returns
WandBVizCallback Early return inside rank-0 check skipped barrier Replaced early return with conditional block
WandBVizCallbackWithPAFs Same as above Same fix

Correct Pattern

The correct pattern (used by working callbacks like CSVLoggerCallback, MatplotlibSaver, TrainingControllerZMQ):

def on_train_epoch_end(self, trainer, pl_module):
    if trainer.is_global_zero:
        # rank-0 only work (no early returns!)
        ...
    # ALL ranks ALWAYS reach this barrier
    trainer.strategy.barrier()

Test plan

  • Manually tested on 4x NVIDIA A40 system with trainer_devices=2 and log_viz=true
  • Verified epochs complete without freezing
  • No automated multi-GPU test (CI lacks multi-GPU environment)

Related

🤖 Generated with Claude Code

Fix multiple callbacks that caused training to freeze/deadlock after each
epoch when using multi-GPU DDP training. The issue was that some processes
reached trainer.strategy.barrier() while others returned early, causing
infinite waits.

Callbacks fixed:
- UnifiedVizCallback: Changed early return pattern to use if block
- EpochEndEvaluationCallback: Added missing barrier, restructured to avoid
  early returns
- WandBVizCallback: Replaced early return with conditional block
- WandBVizCallbackWithPAFs: Same fix as WandBVizCallback

The correct pattern (used by working callbacks like CSVLoggerCallback,
MatplotlibSaver, TrainingControllerZMQ) is:

    if trainer.is_global_zero:
        # rank-0 only work (no early returns!)
    trainer.strategy.barrier()  # ALL ranks must reach this

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 82.35294% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.56%. Comparing base (ff91433) to head (ae30518).
⚠️ Report is 124 commits behind head on main.

Files with missing lines Patch % Lines
sleap_nn/training/callbacks.py 82.35% 15 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (ff91433) and HEAD (ae30518). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (ff91433) HEAD (ae30518)
4 3
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #437       +/-   ##
===========================================
- Coverage   95.28%   84.56%   -10.72%     
===========================================
  Files          49       74       +25     
  Lines        6765    10770     +4005     
===========================================
+ Hits         6446     9108     +2662     
- Misses        319     1662     +1343     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gitttt-1234 gitttt-1234 merged commit 699a8e2 into main Jan 23, 2026
7 checks passed
@gitttt-1234 gitttt-1234 deleted the fix/ddp-callback-synchronization branch January 23, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants