Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-node Tensor Parallel #1257

Open
PieterZanders opened this issue May 12, 2024 · 1 comment
Open

multi-node Tensor Parallel #1257

PieterZanders opened this issue May 12, 2024 · 1 comment

Comments

@PieterZanders
Copy link

PieterZanders commented May 12, 2024

Hello, could you add an new example of the tensor parallel + fsdp but using a multi-node setup?
Is it possible to do multi-node tensor parallelization with pytorch 2.3? I am trying to use 2 nodes with 4 GPUs each.
05/12/2024 04:32:52 PM Device Mesh created: device_mesh=DeviceMesh([[0, 1, 2, 3], [4, 5, 6, 7]], mesh_dim_names=('dp', 'tp'))

When I try the actual example on multiple nodes I get the following errors.

Thank you.


as07r1b31:3011779:3012101 [0] init.cc:871 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1b000
as07r1b31:3011783:3012102 [0] init.cc:871 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1b000
as07r1b31:3011782:3012104 [3] init.cc:871 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ad000
as07r1b31:3011786:3012107 [3] init.cc:871 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device ad000
as07r1b31:3011780:3012106 [1] init.cc:871 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 2c000
as07r1b31:3011784:3012108 [1] init.cc:871 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 2c000
as07r1b31:3011781:3012110 [2] init.cc:871 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 9d000
as07r1b31:3011785:3012111 [2] init.cc:871 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 9d000

[rank0]: Traceback (most recent call last):
[rank0]:   File "/gpfs/mn4/AE_tp/tests.py", line 91, in <module>
[rank0]:     _, output = sharded_model(inp)
[rank0]:                 ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 843, in forward
[rank0]:     args, kwargs = _pre_forward(
[rank0]:                    ^^^^^^^^^^^^^
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 380, in _pre_forward
[rank0]:     unshard_fn(state, handle)
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 415, in _pre_forward_unshard
[rank0]:     _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 299, in _unshard
[rank0]:     handle.unshard()
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 1308, in unshard
[rank0]:     padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
[rank0]:                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py", line 1399, in _all_gather_flat_param
[rank0]:     dist.all_gather_into_tensor(
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor
[rank0]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608847532/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1b000
[same on other ranks]

Traceback (most recent call last):
  File "/home/mn4/AE_tp/mdae2.3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mn4/AE_tp/mdae2.3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tests.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3011780)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3011781)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 3011782)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 3011783)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 3011784)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 3011785)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 3011786)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-12_16:33:02
  host      : as07r1b31.bsc.mn
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3011779)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@Lucazzn
Copy link

Lucazzn commented Nov 5, 2024

Are you sure you added the parallel_module function on this tp_mesh= device_mesh['tp'] , I reported the same error earlier, parallel_module function can only accept 1-D device grid layout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants