You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm in the main branch, and the checkout is also the latest: c67bbab.
my launch command is:
python multigpu.py 30 5
error is below:
[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
Traceback (most recent call last):
File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 104, in<module>
mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 90, in main
trainer = Trainer(model, train_data, optimizer, rank, save_every)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 38, in __init__
self.model = DDP(model, device_ids=[gpu_id])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1695392026823/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA functionfailed.
Last error:
Cuda failure 'invalid argument'
my env:
python 3.11.5 h955ad1f_0
pytorch 2.1.0 py3.11_cuda11.8_cudnn8.7.0_0 pytorch
I'm in the main branch, and the checkout is also the latest: c67bbab.
my launch command is:
error is below:
my env:
python 3.11.5 h955ad1f_0
pytorch 2.1.0 py3.11_cuda11.8_cudnn8.7.0_0 pytorch
container launch command:
Thanks in advance :)
The text was updated successfully, but these errors were encountered: