Skip to content

Comments

Add Multi-Gpu Support to Direct Python Bindings#4689

Merged
rdspring1 merged 3 commits intomainfrom
direct_tp
Jul 8, 2025
Merged

Add Multi-Gpu Support to Direct Python Bindings#4689
rdspring1 merged 3 commits intomainfrom
direct_tp

Conversation

@rdspring1 rdspring1 added Python API Issues related to the Python API Direct Bindings Python extension with direct mapping to NvFuser CPP objects. Thunder-Inference-Demo labels Jun 27, 2025
@rdspring1 rdspring1 requested a review from wujingyue June 27, 2025 05:12
@github-actions
Copy link

github-actions bot commented Jun 27, 2025

Review updated until commit fcfc80d

Description

  • Added Multi-GPU support to Direct Python Bindings

  • Introduced new files for multi-device features

  • Updated bindings to include multi-device functionalities

  • Removed redundant code from frontend module


Changes walkthrough 📝

Relevant files
Enhancement
11 files
distributed_tensor.cpp
Added Sharding and getOutputShardings functions                   
+70/-0   
bindings.cpp
Included multidevice/communicator and bindMultiDevice       
+5/-0     
enum.cpp
Added ParallelType and CommunicatorBackend enums                 
+20/-0   
ir.cpp
Added parallelize, get_loop_domain, split, set_allocation_domain,
set_device_mesh methods
+83/-0   
multidevice.cpp
Added bindings for Communicator, DeviceMesh, and Sharding
+128/-0 
runtime.cpp
Added inputs, outputs, and get_output_shardings methods   
+51/-0   
multidevice_bindings.cpp
Updated DeviceMesh and Sharding bindings to module_local 
+3/-2     
__init__.py
Added execute_with_dtensors function                                         
+36/-0   
executor.h
Included runtime/fusion_kernel_runtime.h                                 
+1/-0     
distributed_tensor.h
Updated namespace and included additional headers               
+9/-2     
bindings.h
Added bindMultiDevice declaration                                               
+3/-0     
Cleanup
3 files
distributed_tensor.cpp
Removed redundant Sharding functions                                         
+0/-35   
fusion_definition.cpp
Removed redundant getOutputShardings function                       
+1/-40   
fusion_definition.h
Updated includes and removed redundant header                       
+1/-1     
Configuration changes
1 files
CMakeLists.txt
Updated source files list for multi-device support             
+2/-1     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
⚡ Recommended focus areas for review

Possible Issue
The execute_with_dtensors function uses self.execute which is not defined in the FusionDefinition class. It should likely use fd.execute instead.

Performance Concern

The get_output_shardings method in runtime.cpp assumes that the number of output shardings matches the number of outputs. This assumption should be validated and documented, especially in cases where outputs might not be sharded.

str
    The scheduled intermediate representation (IR) as a string.

Notes
-----
- Returns None if execution has occurred yet.
)")
      .def(
          "get_output_shardings",
          [](FusionExecutorCache& self) {
            Fusion* fusion = self.getMostRecentKernelRuntime()
                                 ->fusionSegments()
                                 ->completeFusion();
            std::vector<Sharding> output_shardings = getOutputShardings(fusion);
            NVF_ERROR(
                output_shardings.empty() ||
                    std::ssize(output_shardings) ==
                        (int64_t)fusion->outputs().size(),
                "Found ",
                std::ssize(output_shardings),
                " output shardings but expected ",
                fusion->outputs().size(),
                " or 0.");
            return output_shardings;
          },
Code Duplication

The getOutputShardings function is duplicated in both distributed_tensor.cpp and fusion_definition.cpp. This should be refactored to avoid code duplication.

}

std::vector<Sharding> getOutputShardings(Fusion* fusion) {
  std::vector<TensorView*> all_tvs = fusion->allTvs();
  if (std::none_of(
          all_tvs.begin(),
          all_tvs.end(),
          std::mem_fn(&TensorView::hasDeviceMesh))) {
    return {};
  }

  std::vector<Sharding> output_shardings;

@rdspring1 rdspring1 changed the title Add MultiGpu Support to Direct Python Bindings Add Multi-Gpu Support to Direct Python Bindings Jun 27, 2025
# All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause

import nvfuser_direct as nvfd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I have to maintain two conftests.py and two sets of tests that are mostly identical. What can we do to make my life easier?

(I guess this is one of the main problems with the flipping-a-big-switch approach. I've no ideas we didn't take a more incremental approach for Python direct, but I believe you have thought about ways to mitigate the negative impact)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no ideas we didn't take a more incremental approach for Python direct

😆 I didn't either, but I don't want to integrate the lru_cache workaround into Thunder. It is probably sufficient for the llama4 demo.

Option 1: Create FusionRecord for multi-device primitives so it exists in FusionCache
Option 2: Use direct bindings

Option 2 is more forward looking.

What can we do to make my life easier?

The foundation for direct bindings is already merge.
Adding bindings except reshape is easy.
I was going to translate the tests in tests/python/multidevice to direct bindings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1: Create FusionRecord for multi-device primitives so it exists in FusionCache
Option 2: Use direct bindings

I believe there's also an option 3 -- remove FusionCache so different FusionDefinition instances never conflict. That was the initial solution I had in mind for #4507. I forgot why it wasn't considered. Was it that nobody bothered to delete legacy code? Instead, direct bindings came as a more expensive but more "forward looking" solution to replace the legacy FusionRecord.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to translate the tests in tests/python/multidevice to direct bindings.

Why can't direct bindings tests stay in tests/python/multidevice? This way, at least, we don't have to reinvent multidevice fixtures for testing. Actual tests can probably be largely reused. IIUC, they differ in multidevice schedules.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option 3 -- remove FusionCache so different FusionDefinition instances never conflict.

I see the value of option 3. e.g., You don't have to rebind all the operations.
IIRC, @kevinstephano want to keep the FusionRecords at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't direct bindings tests stay in tests/python/multidevice?

So, we want to parameterize multidevice_test to use existing python_frontend and direct_bindings?
Then, on the tests that direct_bindings can run correctly, test both configurations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is kind of painful to combine them together. Maybe do this refactor after #4701.
Seeing this exception because nvfuser and nvfuser_direct both define cleanup.

terminate called after throwing an instance of 'c10d::SocketError'
  what():  The server socket has failed to listen on any local network address. port: 29542, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
Exception raised from makeWithPort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:307 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xbc (0xf906fefedaec in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5a2e5b0 (0xf9072c2ce5b0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5a4729c (0xf9072c2e729c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5a4c1dc (0xf9072c2ec1dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5a4e65c (0xf9072c2ee65c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5a31ca4 (0xf9072c2d1ca4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) + 0x128 (0xf9072c2d5868 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x7ea34c (0xf904ff50a34c in /opt/pytorch/nvfuser/python/nvfuser/_C.cpython-312-aarch64-linux-gnu.so)
frame #8: <unknown function> + 0x7eabbc (0xf904ff50abbc in /opt/pytorch/nvfuser/python/nvfuser/_C.cpython-312-aarch64-linux-gnu.so)
frame #9: nvfuser::python_frontend::cleanup() + 0xc (0xf904fee6a3ec in 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we want to parameterize multidevice_test to use existing python_frontend and direct_bindings?

Yes, a pytest.fixture can take params: https://docs.pytest.org/en/stable/how-to/fixtures.html#fixture-parametrize

I noticed some minor differences between legacy and direct, e.g., nvfuser.Communicator.instance() vs nvfd.multidevice.Communicator.instance(). If too annoying, I'd put Communicator in nvfd top-level for now and refactor after the giant switch.

Then, on the tests that direct_bindings can run correctly, test both configurations.

I'm less worried about duplicating the tests. There's a balance between DRY and DAMP: https://testing.googleblog.com/2019/12/testing-on-toilet-tests-too-dry-make.html. I have certainly seen tests too DRY to even understand, e.g.,

mesh0,
mesh1,
is_stage0_sharded,
is_stage1_sharded,
do_reduction,
sharded_dim,
.

For multi-GPU tests, I think we can parameterize the tests without making them terrible to understand. For example,

def define_foo_math(fd: FusionDefinition):
  ...

def define_foo_multidevice_schedule(fd: FusionDefinition):
  ...

# For legacy bindings
class Model(FusionDefinition):
  def definition(self):
    define_foo_math(self)

  def multidevice_schedule(self):
    define_foo_multidevice_schedule(self):

model = Model()
model.execute(...)

# For direct bindings
with FusionDefinition() as fd:
  define_foo_math(fd)
  define_foo_multidevice_schedule(fd)

fd.execute(...)

Either way, I've no idea how you plan to migrate the tests (multi-GPU or single-GPU). That's more concerning. Do you plan to duplicate/parameterize all of them? Or until you've gain enough confidence on direct bindings? When to flip the switch to make nvfd the default? When will nvFuser engineers start to write nvfd tests? When will we have to maintain both nvfuser tests and nvfd tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it crazy to add just the bindings in the PR and keep the tests in a separate branch? This way, we can unblock Kshiteej without addressing my concerns on the tests. After the crunch, we can certainly discuss the legacy-to-direct migration plan.

@kshitij12345
Copy link
Contributor

Not for this PR but it would be nice to have repro_script_for similar to the existing one -

def repro_script_for(self, inputs: list | None = None) -> str:
msg = "# CUDA devices:\n"

@rdspring1
Copy link
Collaborator Author

It should be straightforward to add repro_script_for for single-gpu. multi-gpu support is less explored.

@rdspring1
Copy link
Collaborator Author

!test

@rdspring1 rdspring1 marked this pull request as ready for review July 7, 2025 16:49
@rdspring1
Copy link
Collaborator Author

@wujingyue I pushed the multi-device python tests further down the stack, so they are not included in this PR.

Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

assert (
"nvfuser" not in sys.modules
), "Cannot import nvfuser_direct if nvfuser module is already imported."
if "nvfuser" in sys.modules:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this change? Importing both triggers a non-deterministic cleanup order (http://nv/eMu). Other singletons (e.g. FusionProfiler) will probably fail in the same way. I'd rather we enforce the one-nvfuser-at-a-time rule for the main branch. (I don't care about that for the inference demo though, which will likely use a separate branch).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You asked me to translate the existing python tests from legacy to direct bindings without duplicating or separating the tests.

I can either duplicate the tests and keep the modules separate OR import both modules.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never tried this so I could be terribly wrong: Can you import nvfuser/nvfuser_direct before importing opinfo? Then opinfo can check which backend was imported via sys.modules.

That said, I don't think it's too bad to import both for the tests. Single-GPU tests don't create Communicators, so they won't trigger the port-in-use error. Multi-GPU tests use very few testing infra -- no opinfo or NVFuserTest. So it should be easy to make sure only one gets imported. How does that sound?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these changes. I can easily cherry-pick #4722 if necessary later on.

@rdspring1
Copy link
Collaborator Author

!test

@rdspring1 rdspring1 merged commit 43e586f into main Jul 8, 2025
49 of 52 checks passed
@rdspring1 rdspring1 deleted the direct_tp branch July 8, 2025 18:08
jacobhinkle added a commit that referenced this pull request Jul 21, 2025
I think this was just part of a refactor from a method to a separate
function in #4689.

Fixes #4806
rdspring1 added a commit that referenced this pull request Jul 22, 2025
This PR adds support for cast operations to Direct Python Bindings.

PR Stack:
- #4689
- #4697  **<<< This PR.**
- #4698
- #4704
- #4701
- #4809
rdspring1 added a commit that referenced this pull request Jul 22, 2025
This PR adds support for matmul and linear ops to Direct Python
Bindings.

PR Stack:
- #4689
- #4697
- #4698 **<<< This PR.**
- #4704
- #4701
- #4809
rdspring1 added a commit that referenced this pull request Jul 22, 2025
…ings (#4704)

This PR adds size, shape, define_vector, and reshape ops to direct
bindings.

PR Stack:
- #4689
- #4697
- #4698
- #4704 **<<< This PR.**
- #4701 
- #4809
rdspring1 added a commit that referenced this pull request Jul 22, 2025
This PR adds support for replacing linear layers with TensorParallel
NvFuser layer in deepseek model using Direct Python Bindings.

PR Stack:
- #4689
- #4697
- #4698
- #4704
- #4701 **<<< This PR.**
- #4809
nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
This PR add MultiGpu Support to Direct Python Bindings.

PR Stack:

- NVIDIA#4689 **<<< This PR.**
- NVIDIA#4697
- NVIDIA#4698
- NVIDIA#4704 
- NVIDIA#4701 

cc: @kshitij12345
nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
I think this was just part of a refactor from a method to a separate
function in NVIDIA#4689.

Fixes NVIDIA#4806
nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
This PR adds support for cast operations to Direct Python Bindings.

PR Stack:
- NVIDIA#4689
- NVIDIA#4697  **<<< This PR.**
- NVIDIA#4698
- NVIDIA#4704
- NVIDIA#4701
- NVIDIA#4809
nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
…IA#4698)

This PR adds support for matmul and linear ops to Direct Python
Bindings.

PR Stack:
- NVIDIA#4689
- NVIDIA#4697
- NVIDIA#4698 **<<< This PR.**
- NVIDIA#4704
- NVIDIA#4701
- NVIDIA#4809
nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
…ings (NVIDIA#4704)

This PR adds size, shape, define_vector, and reshape ops to direct
bindings.

PR Stack:
- NVIDIA#4689
- NVIDIA#4697
- NVIDIA#4698
- NVIDIA#4704 **<<< This PR.**
- NVIDIA#4701 
- NVIDIA#4809
nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
)

This PR adds support for replacing linear layers with TensorParallel
NvFuser layer in deepseek model using Direct Python Bindings.

PR Stack:
- NVIDIA#4689
- NVIDIA#4697
- NVIDIA#4698
- NVIDIA#4704
- NVIDIA#4701 **<<< This PR.**
- NVIDIA#4809
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Direct Bindings Python extension with direct mapping to NvFuser CPP objects. Multi-GPU Python API Issues related to the Python API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants