Add python-nvmath executor for prims.matmul and prims.linear #1917

IvanYashchuk · 2025-03-31T07:29:54Z

Adds an executor using https://github.com/NVIDIA/nvmath-python for prims.matmul and prims.linear. This executor is expected to work with any (matmul-compatible) shape, stride, and dtype combination on CUDA devices.

The executor has an autotuning phase to select a better cuBLASLt configuration for given shapes and strides, and on top of that, there's an additional autotuning phase for selecting nvmath-cuBLASLt or torch.matmul for execution, comparing the median of kernel times. The result of autotuning is cached for one program run.

Tests take about a minute to run:

pytest thunder/tests/test_nvmath_executor.py

thunder/tests/test_nvmath_executor.py::test_matmul_nvmath_cuda_thunder::dtypes::bfloat16 <- thunder/tests/framework.py PASSED                                                                                  [  7%]
thunder/tests/test_nvmath_executor.py::test_matmul_nvmath_cuda_thunder::dtypes::complex128 <- thunder/tests/framework.py PASSED                                                                                [ 14%]
thunder/tests/test_nvmath_executor.py::test_matmul_nvmath_cuda_thunder::dtypes::complex32 <- thunder/tests/framework.py XFAIL                                                                                  [ 21%]
thunder/tests/test_nvmath_executor.py::test_matmul_nvmath_cuda_thunder::dtypes::complex64 <- thunder/tests/framework.py PASSED                                                                                 [ 28%]
thunder/tests/test_nvmath_executor.py::test_matmul_nvmath_cuda_thunder::dtypes::float16 <- thunder/tests/framework.py PASSED                                                                                   [ 35%]
thunder/tests/test_nvmath_executor.py::test_matmul_nvmath_cuda_thunder::dtypes::float32 <- thunder/tests/framework.py PASSED                                                                                   [ 42%]
thunder/tests/test_nvmath_executor.py::test_matmul_nvmath_cuda_thunder::dtypes::float64 <- thunder/tests/framework.py PASSED                                                                                   [ 50%]
thunder/tests/test_nvmath_executor.py::test_linear_nvmath_cuda_thunder::dtypes::bfloat16 <- thunder/tests/framework.py PASSED                                                                                  [ 57%]
thunder/tests/test_nvmath_executor.py::test_linear_nvmath_cuda_thunder::dtypes::complex128 <- thunder/tests/framework.py PASSED                                                                                [ 64%]
thunder/tests/test_nvmath_executor.py::test_linear_nvmath_cuda_thunder::dtypes::complex32 <- thunder/tests/framework.py XFAIL                                                                                  [ 71%]
thunder/tests/test_nvmath_executor.py::test_linear_nvmath_cuda_thunder::dtypes::complex64 <- thunder/tests/framework.py PASSED                                                                                 [ 78%]
thunder/tests/test_nvmath_executor.py::test_linear_nvmath_cuda_thunder::dtypes::float16 <- thunder/tests/framework.py PASSED                                                                                   [ 85%]
thunder/tests/test_nvmath_executor.py::test_linear_nvmath_cuda_thunder::dtypes::float32 <- thunder/tests/framework.py PASSED                                                                                   [ 92%]
thunder/tests/test_nvmath_executor.py::test_linear_nvmath_cuda_thunder::dtypes::float64 <- thunder/tests/framework.py PASSED                                                                                   [100%]

=============================================================================== 12 passed, 2 xfailed, 149 warnings in 79.34s (0:01:19) ===============================================================================

Currently, there's a bug in OperatorExecutors that prevents its transformation from decomposing operations to find supported prims in the multilevel representation.

cc @Borda @mruberry

Copilot

Pull Request Overview

This PR adds an executor leveraging the nvmath-python library for prims.matmul and prims.linear, including an autotuning phase to select an optimal cuBLASLt configuration and a subsequent comparison between nvmath and torch.matmul.

Introduces a new test suite for verifying the executor behavior.
Implements the executor logic with caching and autotuning in nvmathex.py.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
thunder/tests/test_nvmath_executor.py	New tests ensuring matmul and linear operators are executed correctly with the nvmath executor.
thunder/executors/nvmathex.py	Executor implementation with caching, autotuning, and registration for matmul and linear ops.

for more information, see https://pre-commit.ci

mruberry · 2025-03-31T13:04:35Z

Would you create an issue for the OperatorExecutor bug you discovered (and link it here)?

mruberry · 2025-03-31T13:06:20Z

CI issues are real dependency issues:

==================================== ERRORS ====================================
____________ ERROR collecting thunder/tests/test_nvmath_executor.py ____________
ImportError while importing test module '/home/runner/work/lightning-thunder/lightning-thunder/thunder/tests/test_nvmath_executor.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
thunder/tests/test_nvmath_executor.py:5: in <module>
    from thunder.executors.nvmathex import nvmath_matmul_ex
thunder/executors/nvmathex.py:7: in <module>
    import nvmath
E   ModuleNotFoundError: No module named 'nvmath'

Maybe a pytest.importorskip is needed? Then we can file a follow-up issues to add nvmath as a dependency to one or more CI jobs?

mruberry · 2025-03-31T13:06:55Z

thunder/executors/nvmathex.py

@@ -0,0 +1,194 @@
+import logging


A comment

""" The nvmath executor. ... """

would be nice

mruberry · 2025-03-31T13:07:55Z

thunder/executors/nvmathex.py

+@dataclass(frozen=True, slots=True)
+class TensorDescriptor:
+    """
+    A dataclass to store the shape, stride, dtype, and device index of a tensor for caching purposes.


Nice comment

mruberry · 2025-03-31T13:08:05Z

thunder/executors/nvmathex.py

+
+def execute_nvmath_matmul(mm: Matmul, a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Executes a matrix multiplication operation using nvmath for the given operands and matmul executor.


Stellar comment

mruberry · 2025-03-31T13:09:08Z

thunder/executors/nvmathex.py

+    return mm.execute()  # This function has about 75 µs overhead
+
+
+def nvmath_or_pytorch_matmul(mm: Matmul) -> Callable:


fyi @kiya00

We should look at generalizing selecting executors by benchmarking

mruberry · 2025-03-31T13:11:14Z

thunder/executors/nvmathex.py

+    return mm(a, b)
+
+
+def matmul_checker(a: TensorProxy, b: TensorProxy) -> bool:


Nice checker

mruberry · 2025-03-31T13:12:18Z

thunder/tests/test_nvmath_executor.py

+        return [nvmath_matmul_ex]
+
+
+@ops((op for op in opinfos if op.name in ("matmul", "linear")), supported_executors=(nvMathTestExecutor(),))


mruberry · 2025-03-31T13:17:52Z

thunder/tests/test_nvmath_executor.py

+def test(op, device, dtype, executor, comparator):
+    for sample in op.sample_inputs(device, dtype, requires_grad=False):
+        # prims do not support broadcasting
+        if sample.args[0].shape[:-2] != sample.args[1].shape[:-2]:


These are detailed manipulations of the sample inputs, and they make complete sense

One interesting extension to the OpInfos that might be interesting (and certainly is not for this PR) is to easily bind the sample inputs to the signatures of functions, so then instead of having to write

if len(sample.args) == 3:

or

if op.name == "linear" and len(sample.args) == 2:

a developer might be able to write something like

ba = sample.bind(op) if ba['bias'] is not None: ...

Which isn't a huge improvement but it's a little more readable.

mruberry

Really exciting, but the CI is failing because of the nvmath is dependency. Maybe we just want to skip the tests when nvmath isn't installed for now?

t-vi · 2025-03-31T14:26:26Z

Sounds really cool @IvanYashchuk , but I'd be weary of having untested executors.
We lack CI coverage for TransformerEngine due to the hardware we run on, but it's not a great experience.

mruberry · 2025-03-31T14:36:32Z

Sounds really cool @IvanYashchuk , but I'd be weary of having untested executors. We lack CI coverage for TransformerEngine due to the hardware we run on, but it's not a great experience.

We could add nvmath to a CI job, or we could look at testing this in NVIDIA's CI

IvanYashchuk · 2025-04-01T07:47:12Z

thunder/tests/test_nvmath_executor.py

+        return [nvmath_matmul_ex]
+
+
+@ops((op for op in opinfos if op.name in ("matmul", "linear")), supported_executors=(nvMathTestExecutor(),))


The test should be skipped when CUDA devices are not available.

I think the test will be skipped but because the import is unconditional the CI fails during test discovery

riccardofelluga · 2025-04-01T07:55:13Z

thunder/executors/nvmathex.py

+import nvmath
+
+import torch
+from nvmath.linalg.advanced import Matmul, MatmulOptions


Is nvmath included in thunder dependencies already? If not I think it would be good to add an import check 👀

We don't add CUDA-only dependencies to the requirements.txt files. If nvmath is not installed, importing this file will fail with a standard Python import error. What import check would you like to see?

I was thinking at something like in TransformerEngine executor, but it is not mandatory ofc

lightning-thunder/thunder/executors/transformer_engineex.py

Lines 34 to 63 in 2d18b7a

TE_AVAILABLE: bool = package_available("transformer_engine")

# We rely on internal details of TransformerEngine like `_Linear` autograd.Function.

# As these details are not public, they can change

# Ex. addition of a positional argument for cpu_offloading (not as the last argument)

# between version 1.2 and 1.3.

# Hence, we have these guards based on version.

te: None | Any = None

if TE_AVAILABLE:

try:

import transformer_engine.pytorch as te

from transformer_engine.common import recipe

from transformer_engine.common.recipe import MXFP8BlockScaling, DelayedScaling

from transformer_engine.pytorch.constants import MXFP8_BLOCK_SCALING_SIZE

from transformer_engine.pytorch.module.linear import _Linear

from transformer_engine.pytorch.module.base import TransformerEngineBaseModule

from transformer_engine.pytorch.fp8 import FP8GlobalStateManager, get_default_fp8_recipe

from transformer_engine.pytorch.utils import check_dim_for_fp8_exec

from transformer_engine.pytorch.cpu_offload import CPUOffloadEnabled

import transformer_engine_torch as tex

except Exception as ex:

warnings.warn(f"transformer_engine failed to import with exception {ex}")

TE_AVAILABLE = False

TE_VERSION_2_0_PLUS = LooseVersion(version("transformer_engine")) > LooseVersion("2.0")

if not TE_VERSION_2_0_PLUS:

msg = f"Installed version of transformer_engine {version('transformer_engine')} is not supported, please upgrade to version 2.0 from https://github.com/NVIDIA/TransformerEngine/tree/release_v2.0. `transformer_engine_ex` will not be used."

warnings.warn(msg)

TE_AVAILABLE = False

IvanYashchuk added 2 commits March 31, 2025 09:36

Add nvmath_matmul_ex

a782f5e

Add opinfo-based test

305635f

IvanYashchuk added cuda executors labels Mar 31, 2025

IvanYashchuk requested review from mruberry, lantiga and t-vi as code owners March 31, 2025 07:29

IvanYashchuk requested a review from Copilot March 31, 2025 07:30

Copilot AI reviewed Mar 31, 2025

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c6807e

for more information, see https://pre-commit.ci

mruberry reviewed Mar 31, 2025

View reviewed changes

thunder/executors/nvmathex.py

@@ -0,0 +1,194 @@

import logging

Copy link

Collaborator

mruberry Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment

""" The nvmath executor. ... """

would be nice

mruberry reviewed Mar 31, 2025

View reviewed changes

thunder/executors/nvmathex.py

return mm(a, b)

def matmul_checker(a: TensorProxy, b: TensorProxy) -> bool:

Copy link

Collaborator

mruberry Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice checker

mruberry reviewed Mar 31, 2025

View reviewed changes

IvanYashchuk commented Apr 1, 2025

View reviewed changes

riccardofelluga reviewed Apr 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add python-nvmath executor for prims.matmul and prims.linear #1917

Add python-nvmath executor for prims.matmul and prims.linear #1917

IvanYashchuk commented Mar 31, 2025 •

edited by github-actions bot

Loading

Copilot AI left a comment

mruberry commented Mar 31, 2025

mruberry commented Mar 31, 2025

mruberry Mar 31, 2025

mruberry Mar 31, 2025

mruberry Mar 31, 2025

mruberry Mar 31, 2025

mruberry Mar 31, 2025

mruberry Mar 31, 2025

mruberry Mar 31, 2025

mruberry left a comment

t-vi commented Mar 31, 2025

mruberry commented Mar 31, 2025

IvanYashchuk Apr 1, 2025

mruberry Apr 1, 2025

riccardofelluga Apr 1, 2025

IvanYashchuk Apr 1, 2025

riccardofelluga Apr 1, 2025

		return mm.execute() # This function has about 75 µs overhead


		def nvmath_or_pytorch_matmul(mm: Matmul) -> Callable:

		return mm(a, b)


		def matmul_checker(a: TensorProxy, b: TensorProxy) -> bool:

		return [nvmath_matmul_ex]


		@ops((op for op in opinfos if op.name in ("matmul", "linear")), supported_executors=(nvMathTestExecutor(),))

	TE_AVAILABLE: bool = package_available("transformer_engine")

	# We rely on internal details of TransformerEngine like `_Linear` autograd.Function.
	# As these details are not public, they can change
	# Ex. addition of a positional argument for cpu_offloading (not as the last argument)
	# between version 1.2 and 1.3.
	# Hence, we have these guards based on version.

	te: None \| Any = None
	if TE_AVAILABLE:
	try:
	import transformer_engine.pytorch as te
	from transformer_engine.common import recipe
	from transformer_engine.common.recipe import MXFP8BlockScaling, DelayedScaling
	from transformer_engine.pytorch.constants import MXFP8_BLOCK_SCALING_SIZE
	from transformer_engine.pytorch.module.linear import _Linear
	from transformer_engine.pytorch.module.base import TransformerEngineBaseModule
	from transformer_engine.pytorch.fp8 import FP8GlobalStateManager, get_default_fp8_recipe
	from transformer_engine.pytorch.utils import check_dim_for_fp8_exec
	from transformer_engine.pytorch.cpu_offload import CPUOffloadEnabled
	import transformer_engine_torch as tex
	except Exception as ex:
	warnings.warn(f"transformer_engine failed to import with exception {ex}")
	TE_AVAILABLE = False

	TE_VERSION_2_0_PLUS = LooseVersion(version("transformer_engine")) > LooseVersion("2.0")
	if not TE_VERSION_2_0_PLUS:
	msg = f"Installed version of transformer_engine {version('transformer_engine')} is not supported, please upgrade to version 2.0 from https://github.com/NVIDIA/TransformerEngine/tree/release_v2.0. `transformer_engine_ex` will not be used."
	warnings.warn(msg)
	TE_AVAILABLE = False

Add python-nvmath executor for prims.matmul and prims.linear #1917

Are you sure you want to change the base?

Add python-nvmath executor for prims.matmul and prims.linear #1917

Conversation

IvanYashchuk commented Mar 31, 2025 • edited by github-actions bot Loading

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

mruberry commented Mar 31, 2025

mruberry commented Mar 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruberry left a comment

Choose a reason for hiding this comment

t-vi commented Mar 31, 2025

mruberry commented Mar 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanYashchuk commented Mar 31, 2025 •

edited by github-actions bot

Loading