[kernels] Triton kernels (all2allV) do not work on B200 Blackwell "ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)" #1225

lessw2020 · 2025-05-26T18:03:43Z

Bug description

Running deepseek inference (which uses all2allV Triton kernel) fails out with an invalid pointer access.
This is likely due to running on Blackwell as works fine on H100.

Full error:

Running inference with deepseek-ai/DeepSeek-V2-Lite-Chat on (1, 4) mesh
Creating model stage 0 of 1
Creating model stage 0 of 1
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.28k/1.28k [00:00<00:00, 11.6MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.61M/4.61M [00:00<00:00, 39.2MB/s]
Generating: NCCL version 2.26.5+cuda12.9
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank3]:     generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank3]:     result, tokens_generated = func(*args, **kwargs)
[rank3]:                                ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank3]:     preds = model(x)
[rank3]:             ^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank3]:     hidden_states = self.model(
[rank3]:                     ^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank3]:     hidden_states = decoder_layer(
[rank3]:                     ^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank3]:     hidden_states = self.mlp(hidden_states)
[rank3]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank3]:     y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank3]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank3]:     token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank3]:                                       ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank3]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank3]:     _on_device_all_to_all_v(
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank3]:     kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank3]:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank3]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank3]:     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank3]:     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank3]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank1]:     generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank1]:     result, tokens_generated = func(*args, **kwargs)
[rank1]:                                ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank1]:     preds = model(x)
[rank1]:             ^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank1]:     hidden_states = self.model(
[rank1]:                     ^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank1]:     hidden_states = decoder_layer(
[rank1]:                     ^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank1]:     hidden_states = self.mlp(hidden_states)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank1]:     y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank1]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank1]:     token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank1]:                                       ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank1]:     _on_device_all_to_all_v(
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank1]:     kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank1]:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank1]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank1]:     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank1]:     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank1]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank2]:     generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank2]:     result, tokens_generated = func(*args, **kwargs)
[rank2]:                                ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank2]:     preds = model(x)
[rank2]:             ^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank2]:     hidden_states = self.model(
[rank2]:                     ^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank2]:     hidden_states = decoder_layer(
[rank2]:                     ^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank2]:     hidden_states = self.mlp(hidden_states)
[rank2]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank2]:     y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank2]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank2]:     token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank2]:                                       ^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank2]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank2]:     _on_device_all_to_all_v(
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank2]:     kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank2]:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank2]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank2]:     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank2]:     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank2]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

Versions

titan Main - May 26,2025
May 15 torch nightly
B200 devserver

The text was updated successfully, but these errors were encountered:

lessw2020 · 2025-06-05T04:19:56Z

good news - upgrading to Pytorch Nightly 527 (vs earlier May 15) means Triton is now compiling/working on blackwell.

ds inference is thus able to run now on blackwell.
closing

lessw2020 · 2025-06-08T20:32:50Z

nuts - this has returned with latest nightly.

lessw2020 closed this as completed Jun 5, 2025

lessw2020 reopened this Jun 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[kernels] Triton kernels (all2allV) do not work on B200 Blackwell "ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)" #1225

[kernels] Triton kernels (all2allV) do not work on B200 Blackwell "ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)" #1225

lessw2020 commented May 26, 2025

lessw2020 commented Jun 5, 2025

Uh oh!

lessw2020 commented Jun 8, 2025

Uh oh!

[kernels] Triton kernels (all2allV) do not work on B200 Blackwell "ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)" #1225

[kernels] Triton kernels (all2allV) do not work on B200 Blackwell "ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)" #1225

Comments

lessw2020 commented May 26, 2025

Bug description

Versions

lessw2020 commented Jun 5, 2025

Uh oh!

lessw2020 commented Jun 8, 2025

Uh oh!