-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I am training OpenClip on a distributed environment with 4 nodes and 8 GPUs per node. I set epochs to 40 and after training finishes with the last epoch (epoch 39) I am getting an error CUDA. Notice that it trains until the last epoch fine, including the evaluation after each epoch, but when it enters the last evaluation it fails. Here is the traceback below. Shouldn't the model be unwrapped of DDP for evaluate if only running eval on master? I think that while the last evaluation is running, the other ranks are finishing and DDP tries to sync and fails.
I am using pytorch 2.7.1 in python 3.10 and cuda 12.6
2025-10-30,14:14:29 | INFO | Start epoch 39
2025-10-30,14:14:32 | INFO | Train Epoch: 39 [ 8192/2916352 (0%)] Data (t): 2.483 Batch (t): 3.019, 2713.79/s, 84.8059/s/gpu LR: 0.000002 Logit Scale: 88.598 Contrastive_loss: 0.046415 (0.046415) Loss: 0.046415 (0.046415)
2025-10-30,14:15:42 | INFO | Train Epoch: 39 [ 827392/2916352 (28%)] Data (t): 0.085 Batch (t): 0.699, 11850.7/s, 370.335/s/gpu LR: 0.000001 Logit Scale: 88.607 Contrastive_loss: 0.059184 (0.052800) Loss: 0.059184 (0.052800)
2025-10-30,14:16:51 | INFO | Train Epoch: 39 [1646592/2916352 (56%)] Data (t): 0.085 Batch (t): 0.699, 11751.0/s, 367.218/s/gpu LR: 0.000000 Logit Scale: 88.611 Contrastive_loss: 0.048979 (0.051526) Loss: 0.048979 (0.051526)
2025-10-30,14:18:01 | INFO | Train Epoch: 39 [2465792/2916352 (85%)] Data (t): 0.083 Batch (t): 0.697, 12237.6/s, 382.423/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.049657 (0.051059) Loss: 0.049657 (0.051059)
2025-10-30,14:18:39 | INFO | Train Epoch: 39 [2916352/2916352 (100%)] Data (t): 0.082 Batch (t): 0.685, 13228.0/s, 413.375/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.046689 (0.050185) Loss: 0.046689 (0.050185)
Traceback (most recent call last):
File "/scratch/amlt_code/bin/train.py", line 1081, in
main(sys.argv[1:])
File "/scratch/amlt_code/bin/train.py", line 1013, in main
evaluate(model, data, completed_epoch, args, tb_writer=writer, tokenizer=tokenizer)
File "/home/aiscuser/.local/lib/python3.10/site-packages/open_clip_train/train.py", line 281, in evaluate
model_out = model(images, texts)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1633, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1529, in _pre_forward
self._sync_buffers()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2166, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2170, in _sync_module_buffers
self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2192, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2107, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective: CollectiveFingerPrint(SequenceNumber=327510, OpType=BROADCAST, TensorShape=[5929], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). Error:
[/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.25.1.192]:55498