Skip to content

Can not launch DDP training using distributed/ddp-tutorial-series/multigpu.py #1199

@480284856

Description

@480284856

I'm in the main branch, and the checkout is also the latest: c67bbab.
my launch command is:

python multigpu.py 30 5

error is below:

[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
Traceback (most recent call last):
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 104, in <module>
    mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 90, in main
    trainer = Trainer(model, train_data, optimizer, rank, save_every)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 38, in __init__
    self.model = DDP(model, device_ids=[gpu_id])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1695392026823/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'

my env:
python 3.11.5 h955ad1f_0
pytorch 2.1.0 py3.11_cuda11.8_cudnn8.7.0_0 pytorch

container launch command:

docker run -itd \
        --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
        --shm-size '32g' \
        --volume ${PWD}:/workspace \
        --workdir /workspace \
        --name pytorch_examples \
        nvidia/cuda:11.8.0-devel-ubuntu22.04

Thanks in advance :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions