Skip to content

Evaluation error RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective #1121

@samirchar

Description

@samirchar

I am training OpenClip on a distributed environment with 4 nodes and 8 GPUs per node. I set epochs to 40 and after training finishes with the last epoch (epoch 39) I am getting an error CUDA. Notice that it trains until the last epoch fine, including the evaluation after each epoch, but when it enters the last evaluation it fails. Here is the traceback below. Shouldn't the model be unwrapped of DDP for evaluate if only running eval on master? I think that while the last evaluation is running, the other ranks are finishing and DDP tries to sync and fails.

I am using pytorch 2.7.1 in python 3.10 and cuda 12.6

2025-10-30,14:14:29 | INFO | Start epoch 39
2025-10-30,14:14:32 | INFO | Train Epoch: 39 [ 8192/2916352 (0%)] Data (t): 2.483 Batch (t): 3.019, 2713.79/s, 84.8059/s/gpu LR: 0.000002 Logit Scale: 88.598 Contrastive_loss: 0.046415 (0.046415) Loss: 0.046415 (0.046415)
2025-10-30,14:15:42 | INFO | Train Epoch: 39 [ 827392/2916352 (28%)] Data (t): 0.085 Batch (t): 0.699, 11850.7/s, 370.335/s/gpu LR: 0.000001 Logit Scale: 88.607 Contrastive_loss: 0.059184 (0.052800) Loss: 0.059184 (0.052800)
2025-10-30,14:16:51 | INFO | Train Epoch: 39 [1646592/2916352 (56%)] Data (t): 0.085 Batch (t): 0.699, 11751.0/s, 367.218/s/gpu LR: 0.000000 Logit Scale: 88.611 Contrastive_loss: 0.048979 (0.051526) Loss: 0.048979 (0.051526)
2025-10-30,14:18:01 | INFO | Train Epoch: 39 [2465792/2916352 (85%)] Data (t): 0.083 Batch (t): 0.697, 12237.6/s, 382.423/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.049657 (0.051059) Loss: 0.049657 (0.051059)
2025-10-30,14:18:39 | INFO | Train Epoch: 39 [2916352/2916352 (100%)] Data (t): 0.082 Batch (t): 0.685, 13228.0/s, 413.375/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.046689 (0.050185) Loss: 0.046689 (0.050185)
Traceback (most recent call last):
File "/scratch/amlt_code/bin/train.py", line 1081, in
main(sys.argv[1:])
File "/scratch/amlt_code/bin/train.py", line 1013, in main
evaluate(model, data, completed_epoch, args, tb_writer=writer, tokenizer=tokenizer)
File "/home/aiscuser/.local/lib/python3.10/site-packages/open_clip_train/train.py", line 281, in evaluate
model_out = model(images, texts)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1633, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1529, in _pre_forward
self._sync_buffers()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2166, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2170, in _sync_module_buffers
self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2192, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2107, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective: CollectiveFingerPrint(SequenceNumber=327510, OpType=BROADCAST, TensorShape=[5929], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). Error:
[/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.25.1.192]:55498

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions