Evaluation error RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective

I am training OpenClip on a distributed environment with 4 nodes and 8 GPUs per node. I set epochs to 40 and after training finishes with the last epoch (epoch 39) I am getting an error CUDA. Notice that it trains until the last epoch fine, including the evaluation after each epoch, but when it enters the last evaluation it fails. Here is the traceback below. Shouldn't the model be unwrapped of DDP for evaluate if only running eval on master? I think that while the last evaluation is running, the other ranks are finishing and DDP tries to sync and fails. 

I am using pytorch 2.7.1 in python 3.10 and cuda 12.6

2025-10-30,14:14:29 | INFO | Start epoch 39
2025-10-30,14:14:32 | INFO | Train Epoch: 39 [   8192/2916352 (0%)] Data (t): 2.483 Batch (t): 3.019, 2713.79/s, 84.8059/s/gpu LR: 0.000002 Logit Scale: 88.598 Contrastive_loss: 0.046415 (0.046415) Loss: 0.046415 (0.046415)
2025-10-30,14:15:42 | INFO | Train Epoch: 39 [ 827392/2916352 (28%)] Data (t): 0.085 Batch (t): 0.699, 11850.7/s, 370.335/s/gpu LR: 0.000001 Logit Scale: 88.607 Contrastive_loss: 0.059184 (0.052800) Loss: 0.059184 (0.052800)
2025-10-30,14:16:51 | INFO | Train Epoch: 39 [1646592/2916352 (56%)] Data (t): 0.085 Batch (t): 0.699, 11751.0/s, 367.218/s/gpu LR: 0.000000 Logit Scale: 88.611 Contrastive_loss: 0.048979 (0.051526) Loss: 0.048979 (0.051526)
2025-10-30,14:18:01 | INFO | Train Epoch: 39 [2465792/2916352 (85%)] Data (t): 0.083 Batch (t): 0.697, 12237.6/s, 382.423/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.049657 (0.051059) Loss: 0.049657 (0.051059)
2025-10-30,14:18:39 | INFO | Train Epoch: 39 [2916352/2916352 (100%)] Data (t): 0.082 Batch (t): 0.685, 13228.0/s, 413.375/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.046689 (0.050185) Loss: 0.046689 (0.050185)
Traceback (most recent call last):
  File "/scratch/amlt_code/bin/train.py", line 1081, in <module>
    main(sys.argv[1:])
  File "/scratch/amlt_code/bin/train.py", line 1013, in main
    evaluate(model, data, completed_epoch, args, tb_writer=writer, tokenizer=tokenizer)
  File "/home/aiscuser/.local/lib/python3.10/site-packages/open_clip_train/train.py", line 281, in evaluate
    model_out = model(images, texts)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1633, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1529, in _pre_forward
    self._sync_buffers()
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2166, in _sync_buffers
    self._sync_module_buffers(authoritative_rank)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2170, in _sync_module_buffers
    self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2192, in _default_broadcast_coalesced
    self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2107, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective: CollectiveFingerPrint(SequenceNumber=327510, OpType=BROADCAST, TensorShape=[5929], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). Error: 
[/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.25.1.192]:55498

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation error RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective #1121

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation error RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective #1121

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions