Skip to content

[kernels] Triton kernels (all2allV) do not work on B200 Blackwell "ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)" #1225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lessw2020 opened this issue May 26, 2025 · 2 comments

Comments

@lessw2020
Copy link
Contributor

Bug description

Running deepseek inference (which uses all2allV Triton kernel) fails out with an invalid pointer access.
This is likely due to running on Blackwell as works fine on H100.

Full error:

Running inference with deepseek-ai/DeepSeek-V2-Lite-Chat on (1, 4) mesh
Creating model stage 0 of 1
Creating model stage 0 of 1
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.28k/1.28k [00:00<00:00, 11.6MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.61M/4.61M [00:00<00:00, 39.2MB/s]
Generating: NCCL version 2.26.5+cuda12.9
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank3]:     generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank3]:     result, tokens_generated = func(*args, **kwargs)
[rank3]:                                ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank3]:     preds = model(x)
[rank3]:             ^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank3]:     hidden_states = self.model(
[rank3]:                     ^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank3]:     hidden_states = decoder_layer(
[rank3]:                     ^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank3]:     hidden_states = self.mlp(hidden_states)
[rank3]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank3]:     y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank3]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank3]:     token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank3]:                                       ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank3]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank3]:     _on_device_all_to_all_v(
[rank3]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank3]:     kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank3]:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank3]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank3]:     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank3]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank3]:     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank3]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank1]:     generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank1]:     result, tokens_generated = func(*args, **kwargs)
[rank1]:                                ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank1]:     preds = model(x)
[rank1]:             ^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank1]:     hidden_states = self.model(
[rank1]:                     ^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank1]:     hidden_states = decoder_layer(
[rank1]:                     ^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank1]:     hidden_states = self.mlp(hidden_states)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank1]:     y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank1]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank1]:     token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank1]:                                       ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank1]:     _on_device_all_to_all_v(
[rank1]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank1]:     kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank1]:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank1]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank1]:     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank1]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank1]:     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank1]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank2]:     generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank2]:     result, tokens_generated = func(*args, **kwargs)
[rank2]:                                ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank2]:     preds = model(x)
[rank2]:             ^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank2]:     hidden_states = self.model(
[rank2]:                     ^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank2]:     hidden_states = decoder_layer(
[rank2]:                     ^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank2]:     hidden_states = self.mlp(hidden_states)
[rank2]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank2]:     y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank2]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank2]:     token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank2]:                                       ^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank2]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank2]:     _on_device_all_to_all_v(
[rank2]:   File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank2]:     kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank2]:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank2]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank2]:     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank2]:   File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank2]:     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank2]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

Versions

titan Main - May 26,2025
May 15 torch nightly
B200 devserver

@lessw2020
Copy link
Contributor Author

good news - upgrading to Pytorch Nightly 527 (vs earlier May 15) means Triton is now compiling/working on blackwell.

ds inference is thus able to run now on blackwell.
closing

@lessw2020
Copy link
Contributor Author

nuts - this has returned with latest nightly.

@lessw2020 lessw2020 reopened this Jun 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant