Conversation
|
Review updated until commit fcfc80d Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
| # All rights reserved. | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
|
|
||
| import nvfuser_direct as nvfd |
There was a problem hiding this comment.
Now I have to maintain two conftests.py and two sets of tests that are mostly identical. What can we do to make my life easier?
(I guess this is one of the main problems with the flipping-a-big-switch approach. I've no ideas we didn't take a more incremental approach for Python direct, but I believe you have thought about ways to mitigate the negative impact)
There was a problem hiding this comment.
I've no ideas we didn't take a more incremental approach for Python direct
😆 I didn't either, but I don't want to integrate the lru_cache workaround into Thunder. It is probably sufficient for the llama4 demo.
Option 1: Create FusionRecord for multi-device primitives so it exists in FusionCache
Option 2: Use direct bindings
Option 2 is more forward looking.
What can we do to make my life easier?
The foundation for direct bindings is already merge.
Adding bindings except reshape is easy.
I was going to translate the tests in tests/python/multidevice to direct bindings.
There was a problem hiding this comment.
Option 1: Create FusionRecord for multi-device primitives so it exists in FusionCache
Option 2: Use direct bindings
I believe there's also an option 3 -- remove FusionCache so different FusionDefinition instances never conflict. That was the initial solution I had in mind for #4507. I forgot why it wasn't considered. Was it that nobody bothered to delete legacy code? Instead, direct bindings came as a more expensive but more "forward looking" solution to replace the legacy FusionRecord.
There was a problem hiding this comment.
I was going to translate the tests in tests/python/multidevice to direct bindings.
Why can't direct bindings tests stay in tests/python/multidevice? This way, at least, we don't have to reinvent multidevice fixtures for testing. Actual tests can probably be largely reused. IIUC, they differ in multidevice schedules.
There was a problem hiding this comment.
option 3 -- remove FusionCache so different FusionDefinition instances never conflict.
I see the value of option 3. e.g., You don't have to rebind all the operations.
IIRC, @kevinstephano want to keep the FusionRecords at all.
There was a problem hiding this comment.
Why can't direct bindings tests stay in tests/python/multidevice?
So, we want to parameterize multidevice_test to use existing python_frontend and direct_bindings?
Then, on the tests that direct_bindings can run correctly, test both configurations.
There was a problem hiding this comment.
It is kind of painful to combine them together. Maybe do this refactor after #4701.
Seeing this exception because nvfuser and nvfuser_direct both define cleanup.
terminate called after throwing an instance of 'c10d::SocketError'
what(): The server socket has failed to listen on any local network address. port: 29542, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
Exception raised from makeWithPort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:307 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xbc (0xf906fefedaec in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5a2e5b0 (0xf9072c2ce5b0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5a4729c (0xf9072c2e729c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5a4c1dc (0xf9072c2ec1dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5a4e65c (0xf9072c2ee65c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5a31ca4 (0xf9072c2d1ca4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) + 0x128 (0xf9072c2d5868 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x7ea34c (0xf904ff50a34c in /opt/pytorch/nvfuser/python/nvfuser/_C.cpython-312-aarch64-linux-gnu.so)
frame #8: <unknown function> + 0x7eabbc (0xf904ff50abbc in /opt/pytorch/nvfuser/python/nvfuser/_C.cpython-312-aarch64-linux-gnu.so)
frame #9: nvfuser::python_frontend::cleanup() + 0xc (0xf904fee6a3ec in
There was a problem hiding this comment.
So, we want to parameterize multidevice_test to use existing python_frontend and direct_bindings?
Yes, a pytest.fixture can take params: https://docs.pytest.org/en/stable/how-to/fixtures.html#fixture-parametrize
I noticed some minor differences between legacy and direct, e.g., nvfuser.Communicator.instance() vs nvfd.multidevice.Communicator.instance(). If too annoying, I'd put Communicator in nvfd top-level for now and refactor after the giant switch.
Then, on the tests that direct_bindings can run correctly, test both configurations.
I'm less worried about duplicating the tests. There's a balance between DRY and DAMP: https://testing.googleblog.com/2019/12/testing-on-toilet-tests-too-dry-make.html. I have certainly seen tests too DRY to even understand, e.g.,
Fuser/tests/cpp/test_multidevice_pipeline.cpp
Lines 268 to 273 in c6aab66
For multi-GPU tests, I think we can parameterize the tests without making them terrible to understand. For example,
def define_foo_math(fd: FusionDefinition):
...
def define_foo_multidevice_schedule(fd: FusionDefinition):
...
# For legacy bindings
class Model(FusionDefinition):
def definition(self):
define_foo_math(self)
def multidevice_schedule(self):
define_foo_multidevice_schedule(self):
model = Model()
model.execute(...)
# For direct bindings
with FusionDefinition() as fd:
define_foo_math(fd)
define_foo_multidevice_schedule(fd)
fd.execute(...)Either way, I've no idea how you plan to migrate the tests (multi-GPU or single-GPU). That's more concerning. Do you plan to duplicate/parameterize all of them? Or until you've gain enough confidence on direct bindings? When to flip the switch to make nvfd the default? When will nvFuser engineers start to write nvfd tests? When will we have to maintain both nvfuser tests and nvfd tests?
There was a problem hiding this comment.
Is it crazy to add just the bindings in the PR and keep the tests in a separate branch? This way, we can unblock Kshiteej without addressing my concerns on the tests. After the crunch, we can certainly discuss the legacy-to-direct migration plan.
8fe2f16 to
d9dc8e5
Compare
|
Not for this PR but it would be nice to have Fuser/python/nvfuser/__init__.py Lines 508 to 509 in 54c60d6 |
|
It should be straightforward to add |
|
!test |
|
@wujingyue I pushed the multi-device python tests further down the stack, so they are not included in this PR. |
python/nvfuser_direct/__init__.py
Outdated
| assert ( | ||
| "nvfuser" not in sys.modules | ||
| ), "Cannot import nvfuser_direct if nvfuser module is already imported." | ||
| if "nvfuser" in sys.modules: |
There was a problem hiding this comment.
Are you sure about this change? Importing both triggers a non-deterministic cleanup order (http://nv/eMu). Other singletons (e.g. FusionProfiler) will probably fail in the same way. I'd rather we enforce the one-nvfuser-at-a-time rule for the main branch. (I don't care about that for the inference demo though, which will likely use a separate branch).
There was a problem hiding this comment.
You asked me to translate the existing python tests from legacy to direct bindings without duplicating or separating the tests.
I can either duplicate the tests and keep the modules separate OR import both modules.
There was a problem hiding this comment.
I've never tried this so I could be terribly wrong: Can you import nvfuser/nvfuser_direct before importing opinfo? Then opinfo can check which backend was imported via sys.modules.
That said, I don't think it's too bad to import both for the tests. Single-GPU tests don't create Communicators, so they won't trigger the port-in-use error. Multi-GPU tests use very few testing infra -- no opinfo or NVFuserTest. So it should be easy to make sure only one gets imported. How does that sound?
There was a problem hiding this comment.
I removed these changes. I can easily cherry-pick #4722 if necessary later on.
|
!test |
This PR add MultiGpu Support to Direct Python Bindings. PR Stack: - NVIDIA#4689 **<<< This PR.** - NVIDIA#4697 - NVIDIA#4698 - NVIDIA#4704 - NVIDIA#4701 cc: @kshitij12345
I think this was just part of a refactor from a method to a separate function in NVIDIA#4689. Fixes NVIDIA#4806
This PR adds support for cast operations to Direct Python Bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 **<<< This PR.** - NVIDIA#4698 - NVIDIA#4704 - NVIDIA#4701 - NVIDIA#4809
…IA#4698) This PR adds support for matmul and linear ops to Direct Python Bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 - NVIDIA#4698 **<<< This PR.** - NVIDIA#4704 - NVIDIA#4701 - NVIDIA#4809
…ings (NVIDIA#4704) This PR adds size, shape, define_vector, and reshape ops to direct bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 - NVIDIA#4698 - NVIDIA#4704 **<<< This PR.** - NVIDIA#4701 - NVIDIA#4809
) This PR adds support for replacing linear layers with TensorParallel NvFuser layer in deepseek model using Direct Python Bindings. PR Stack: - NVIDIA#4689 - NVIDIA#4697 - NVIDIA#4698 - NVIDIA#4704 - NVIDIA#4701 **<<< This PR.** - NVIDIA#4809
This PR add MultiGpu Support to Direct Python Bindings.
PR Stack:
cc: @kshitij12345