Add Multi-Gpu Support to Direct Python Bindings by rdspring1 · Pull Request #4689 · NVIDIA/Fuser

rdspring1 · 2025-06-27T05:12:07Z

This PR add MultiGpu Support to Direct Python Bindings.

PR Stack:

cc: @kshitij12345

github-actions · 2025-06-27T05:13:03Z

Review updated until commit fcfc80d

Description

Added Multi-GPU support to Direct Python Bindings
Introduced new files for multi-device features
Updated bindings to include multi-device functionalities
Removed redundant code from frontend module

Changes walkthrough 📝

Relevant files

Enhancement

11 files

distributed_tensor.cpp `Added Sharding and getOutputShardings functions`	+70/-0
bindings.cpp `Included multidevice/communicator and bindMultiDevice`	+5/-0
enum.cpp `Added ParallelType and CommunicatorBackend enums`	+20/-0
ir.cpp `Added parallelize, get_loop_domain, split, set_allocation_domain,` `set_device_mesh methods`	+83/-0
multidevice.cpp `Added bindings for Communicator, DeviceMesh, and Sharding`	+128/-0
runtime.cpp `Added inputs, outputs, and get_output_shardings methods`	+51/-0
multidevice_bindings.cpp `Updated DeviceMesh and Sharding bindings to module_local`	+3/-2
__init__.py `Added execute_with_dtensors function`	+36/-0
executor.h `Included runtime/fusion_kernel_runtime.h`	+1/-0
distributed_tensor.h `Updated namespace and included additional headers`	+9/-2
bindings.h `Added bindMultiDevice declaration`	+3/-0

Cleanup

3 files

distributed_tensor.cpp `Removed redundant Sharding functions`	+0/-35
fusion_definition.cpp `Removed redundant getOutputShardings function`	+1/-40
fusion_definition.h `Updated includes and removed redundant header`	+1/-1

Configuration changes

1 files

CMakeLists.txt `Updated source files list for multi-device support`	+2/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests

⚡ Recommended focus areas for review

Possible Issue
The execute_with_dtensors function uses self.execute which is not defined in the FusionDefinition class. It should likely use fd.execute instead.

Performance Concern

The get_output_shardings method in runtime.cpp assumes that the number of output shardings matches the number of outputs. This assumption should be validated and documented, especially in cases where outputs might not be sharded.

str
    The scheduled intermediate representation (IR) as a string.

Notes
-----
- Returns None if execution has occurred yet.
)")
      .def(
          "get_output_shardings",
          [](FusionExecutorCache& self) {
            Fusion* fusion = self.getMostRecentKernelRuntime()
                                 ->fusionSegments()
                                 ->completeFusion();
            std::vector<Sharding> output_shardings = getOutputShardings(fusion);
            NVF_ERROR(
                output_shardings.empty() ||
                    std::ssize(output_shardings) ==
                        (int64_t)fusion->outputs().size(),
                "Found ",
                std::ssize(output_shardings),
                " output shardings but expected ",
                fusion->outputs().size(),
                " or 0.");
            return output_shardings;
          },

Code Duplication

The getOutputShardings function is duplicated in both distributed_tensor.cpp and fusion_definition.cpp. This should be refactored to avoid code duplication.

}

std::vector<Sharding> getOutputShardings(Fusion* fusion) {
  std::vector<TensorView*> all_tvs = fusion->allTvs();
  if (std::none_of(
          all_tvs.begin(),
          all_tvs.end(),
          std::mem_fn(&TensorView::hasDeviceMesh))) {
    return {};
  }

  std::vector<Sharding> output_shardings;

test_nvf_direct_multiple_fd.py

CMakeLists.txt

python/nvfuser_direct/__init__.py

wujingyue · 2025-06-30T06:40:17Z

tests/python_direct/multidevice/conftest.py

+# All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+
+import nvfuser_direct as nvfd


Now I have to maintain two conftests.py and two sets of tests that are mostly identical. What can we do to make my life easier?

(I guess this is one of the main problems with the flipping-a-big-switch approach. I've no ideas we didn't take a more incremental approach for Python direct, but I believe you have thought about ways to mitigate the negative impact)

I've no ideas we didn't take a more incremental approach for Python direct

😆 I didn't either, but I don't want to integrate the lru_cache workaround into Thunder. It is probably sufficient for the llama4 demo.

Option 1: Create FusionRecord for multi-device primitives so it exists in FusionCache
Option 2: Use direct bindings

Option 2 is more forward looking.

What can we do to make my life easier?

The foundation for direct bindings is already merge.
Adding bindings except reshape is easy.
I was going to translate the tests in tests/python/multidevice to direct bindings.

Option 1: Create FusionRecord for multi-device primitives so it exists in FusionCache
Option 2: Use direct bindings

I believe there's also an option 3 -- remove FusionCache so different FusionDefinition instances never conflict. That was the initial solution I had in mind for #4507. I forgot why it wasn't considered. Was it that nobody bothered to delete legacy code? Instead, direct bindings came as a more expensive but more "forward looking" solution to replace the legacy FusionRecord.

I was going to translate the tests in tests/python/multidevice to direct bindings.

Why can't direct bindings tests stay in tests/python/multidevice? This way, at least, we don't have to reinvent multidevice fixtures for testing. Actual tests can probably be largely reused. IIUC, they differ in multidevice schedules.

option 3 -- remove FusionCache so different FusionDefinition instances never conflict.

I see the value of option 3. e.g., You don't have to rebind all the operations.
IIRC, @kevinstephano want to keep the FusionRecords at all.

Why can't direct bindings tests stay in tests/python/multidevice?

So, we want to parameterize multidevice_test to use existing python_frontend and direct_bindings?
Then, on the tests that direct_bindings can run correctly, test both configurations.

It is kind of painful to combine them together. Maybe do this refactor after #4701.
Seeing this exception because nvfuser and nvfuser_direct both define cleanup.

terminate called after throwing an instance of 'c10d::SocketError' what(): The server socket has failed to listen on any local network address. port: 29542, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use Exception raised from makeWithPort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:307 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xbc (0xf906fefedaec in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x5a2e5b0 (0xf9072c2ce5b0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so) frame #2: <unknown function> + 0x5a4729c (0xf9072c2e729c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so) frame #3: <unknown function> + 0x5a4c1dc (0xf9072c2ec1dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so) frame #4: <unknown function> + 0x5a4e65c (0xf9072c2ee65c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0x5a31ca4 (0xf9072c2d1ca4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) + 0x128 (0xf9072c2d5868 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so) frame #7: <unknown function> + 0x7ea34c (0xf904ff50a34c in /opt/pytorch/nvfuser/python/nvfuser/_C.cpython-312-aarch64-linux-gnu.so) frame #8: <unknown function> + 0x7eabbc (0xf904ff50abbc in /opt/pytorch/nvfuser/python/nvfuser/_C.cpython-312-aarch64-linux-gnu.so) frame #9: nvfuser::python_frontend::cleanup() + 0xc (0xf904fee6a3ec in

So, we want to parameterize multidevice_test to use existing python_frontend and direct_bindings?

Yes, a pytest.fixture can take params: https://docs.pytest.org/en/stable/how-to/fixtures.html#fixture-parametrize

I noticed some minor differences between legacy and direct, e.g., nvfuser.Communicator.instance() vs nvfd.multidevice.Communicator.instance(). If too annoying, I'd put Communicator in nvfd top-level for now and refactor after the giant switch.

Then, on the tests that direct_bindings can run correctly, test both configurations.

I'm less worried about duplicating the tests. There's a balance between DRY and DAMP: https://testing.googleblog.com/2019/12/testing-on-toilet-tests-too-dry-make.html. I have certainly seen tests too DRY to even understand, e.g.,

Fuser/tests/cpp/test_multidevice_pipeline.cpp

Lines 268 to 273 in c6aab66

mesh0,

mesh1,

is_stage0_sharded,

is_stage1_sharded,

do_reduction,

sharded_dim,

.

For multi-GPU tests, I think we can parameterize the tests without making them terrible to understand. For example,

def define_foo_math(fd: FusionDefinition): ... def define_foo_multidevice_schedule(fd: FusionDefinition): ... # For legacy bindings class Model(FusionDefinition): def definition(self): define_foo_math(self) def multidevice_schedule(self): define_foo_multidevice_schedule(self): model = Model() model.execute(...) # For direct bindings with FusionDefinition() as fd: define_foo_math(fd) define_foo_multidevice_schedule(fd) fd.execute(...)

Either way, I've no idea how you plan to migrate the tests (multi-GPU or single-GPU). That's more concerning. Do you plan to duplicate/parameterize all of them? Or until you've gain enough confidence on direct bindings? When to flip the switch to make nvfd the default? When will nvFuser engineers start to write nvfd tests? When will we have to maintain both nvfuser tests and nvfd tests?

Is it crazy to add just the bindings in the PR and keep the tests in a separate branch? This way, we can unblock Kshiteej without addressing my concerns on the tests. After the crunch, we can certainly discuss the legacy-to-direct migration plan.

kshitij12345 · 2025-07-04T14:53:27Z

Not for this PR but it would be nice to have repro_script_for similar to the existing one -

Fuser/python/nvfuser/__init__.py

Lines 508 to 509 in 54c60d6

    
           def repro_script_for(self, inputs: list | None = None) -> str: 
        
               msg = "# CUDA devices:\n"

rdspring1 · 2025-07-04T16:31:46Z

It should be straightforward to add repro_script_for for single-gpu. multi-gpu support is less explored.

rdspring1 · 2025-07-07T16:49:46Z

!test

rdspring1 · 2025-07-07T16:53:09Z

@wujingyue I pushed the multi-device python tests further down the stack, so they are not included in this PR.

wujingyue

LGTM otherwise

wujingyue · 2025-07-07T17:46:26Z

python/nvfuser_direct/__init__.py

-assert (
-    "nvfuser" not in sys.modules
-), "Cannot import nvfuser_direct if nvfuser module is already imported."
+if "nvfuser" in sys.modules:


Are you sure about this change? Importing both triggers a non-deterministic cleanup order (http://nv/eMu). Other singletons (e.g. FusionProfiler) will probably fail in the same way. I'd rather we enforce the one-nvfuser-at-a-time rule for the main branch. (I don't care about that for the inference demo though, which will likely use a separate branch).

You asked me to translate the existing python tests from legacy to direct bindings without duplicating or separating the tests.

I can either duplicate the tests and keep the modules separate OR import both modules.

I've never tried this so I could be terribly wrong: Can you import nvfuser/nvfuser_direct before importing opinfo? Then opinfo can check which backend was imported via sys.modules.

That said, I don't think it's too bad to import both for the tests. Single-GPU tests don't create Communicators, so they won't trigger the port-in-use error. Multi-GPU tests use very few testing infra -- no opinfo or NVFuserTest. So it should be easy to make sure only one gets imported. How does that sound?

I removed these changes. I can easily cherry-pick #4722 if necessary later on.

python/nvfuser_direct/__init__.py

rdspring1 · 2025-07-07T21:26:21Z

!test

I think this was just part of a refactor from a method to a separate function in #4689. Fixes #4806

This PR adds support for cast operations to Direct Python Bindings. PR Stack: - #4689 - #4697 **<<< This PR.** - #4698 - #4704 - #4701 - #4809

This PR adds support for matmul and linear ops to Direct Python Bindings. PR Stack: - #4689 - #4697 - #4698 **<<< This PR.** - #4704 - #4701 - #4809

…ings (#4704) This PR adds size, shape, define_vector, and reshape ops to direct bindings. PR Stack: - #4689 - #4697 - #4698 - #4704 **<<< This PR.** - #4701 - #4809

This PR adds support for replacing linear layers with TensorParallel NvFuser layer in deepseek model using Direct Python Bindings. PR Stack: - #4689 - #4697 - #4698 - #4704 - #4701 **<<< This PR.** - #4809

@kshitij12345

This PR add MultiGpu Support to Direct Python Bindings. PR Stack: - NVIDIA#4689 **<<< This PR.** - NVIDIA#4697 - NVIDIA#4698 - NVIDIA#4704 - NVIDIA#4701 cc: @kshitij12345

I think this was just part of a refactor from a method to a separate function in NVIDIA#4689. Fixes NVIDIA#4806

This PR adds support for cast operations to Direct Python Bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 **<<< This PR.** - NVIDIA#4698 - NVIDIA#4704 - NVIDIA#4701 - NVIDIA#4809

…IA#4698) This PR adds support for matmul and linear ops to Direct Python Bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 - NVIDIA#4698 **<<< This PR.** - NVIDIA#4704 - NVIDIA#4701 - NVIDIA#4809

…ings (NVIDIA#4704) This PR adds size, shape, define_vector, and reshape ops to direct bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 - NVIDIA#4698 - NVIDIA#4704 **<<< This PR.** - NVIDIA#4701 - NVIDIA#4809

) This PR adds support for replacing linear layers with TensorParallel NvFuser layer in deepseek model using Direct Python Bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 - NVIDIA#4698 - NVIDIA#4704 - NVIDIA#4701 **<<< This PR.** - NVIDIA#4809

rdspring1 added Python API Issues related to the Python API Direct Bindings Python extension with direct mapping to NvFuser CPP objects. Thunder-Inference-Demo labels Jun 27, 2025

rdspring1 requested a review from wujingyue June 27, 2025 05:12

rdspring1 changed the title ~~Add MultiGpu Support to Direct Python Bindings~~ Add Multi-Gpu Support to Direct Python Bindings Jun 27, 2025

kshitij12345 reviewed Jun 27, 2025

View reviewed changes

test_nvf_direct_multiple_fd.py Outdated Show resolved Hide resolved

rdspring1 force-pushed the direct_tp branch from cdc60b9 to 7bce3f6 Compare June 28, 2025 01:03

This was referenced Jun 28, 2025

Add support for cast operations to Direct Python Bindings #4697

Merged

Add support for matmul and linear ops to Direct Python Bindings #4698

Merged

rdspring1 added the Multi-GPU label Jun 28, 2025

rdspring1 force-pushed the direct_tp branch from 7bce3f6 to 3a02148 Compare June 28, 2025 23:48

rdspring1 mentioned this pull request Jun 28, 2025

Add support for permute operation to Direct Python Bindings #4701

Merged

wujingyue reviewed Jun 30, 2025

View reviewed changes

rdspring1 mentioned this pull request Jun 30, 2025

Add size, shape, define_vector, and reshape ops to Direct Python Bindings #4704

Merged

rdspring1 force-pushed the direct_tp branch 2 times, most recently from 8fe2f16 to d9dc8e5 Compare July 1, 2025 20:06

rdspring1 force-pushed the direct_tp branch from d9dc8e5 to a1e2d1c Compare July 5, 2025 00:07

rdspring1 marked this pull request as ready for review July 7, 2025 16:49

wujingyue reviewed Jul 7, 2025

View reviewed changes

rdspring1 added 2 commits July 7, 2025 12:27

Add Multi-Gpu Support to Direct Python Bindings

dcf83cf

comments

64704b7

rdspring1 force-pushed the direct_tp branch from c388ffe to 64704b7 Compare July 7, 2025 19:27

wujingyue approved these changes Jul 7, 2025

View reviewed changes

prohibit importing both frontend modules

fcfc80d

rdspring1 merged commit 43e586f into main Jul 8, 2025
49 of 52 checks passed

rdspring1 deleted the direct_tp branch July 8, 2025 18:08

kshitij12345 mentioned this pull request Jul 17, 2025

[DTensor] Execute Llama4 DecorderLayer with DTensor via nvFuser Lightning-AI/lightning-thunder#2338

Closed

27 tasks

jacobhinkle mentioned this pull request Jul 21, 2025

Rename self to fd in nvfuser_direct.execute_with_dtensors #4807

Merged

rdspring1 removed the Thunder-Inference-Demo label Jul 21, 2025

jacobhinkle added a commit that referenced this pull request Jul 21, 2025

Rename self to fd in nvfuser_direct.execute_with_dtensors (#4807)

990df24

I think this was just part of a refactor from a method to a separate function in #4689. Fixes #4806

rdspring1 added a commit that referenced this pull request Jul 22, 2025

Add support for cast operations to Direct Python Bindings (#4697)

ef06963

This PR adds support for cast operations to Direct Python Bindings. PR Stack: - #4689 - #4697 **<<< This PR.** - #4698 - #4704 - #4701 - #4809

nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025

Rename self to fd in nvfuser_direct.execute_with_dtensors (NVIDIA#4807)

fd7cbb8

I think this was just part of a refactor from a method to a separate function in NVIDIA#4689. Fixes NVIDIA#4806

rdspring1 added this to the Multidevice support in direct bindings milestone Aug 12, 2025

	mesh0,
	mesh1,
	is_stage0_sharded,
	is_stage1_sharded,
	do_reduction,
	sharded_dim,

Comments

Conversation

rdspring1 commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshitij12345 commented Jul 4, 2025

Uh oh!

rdspring1 commented Jul 4, 2025

Uh oh!

rdspring1 commented Jul 7, 2025

Uh oh!

rdspring1 commented Jul 7, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rdspring1 commented Jul 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdspring1 commented Jun 27, 2025 •

edited

Loading

github-actions bot commented Jun 27, 2025 •

edited

Loading