You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[kernels] Triton kernels (all2allV) do not work on B200 Blackwell "ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)"
#1225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Running deepseek inference (which uses all2allV Triton kernel) fails out with an invalid pointer access.
This is likely due to running on Blackwell as works fine on H100.
Full error:
Running inference with deepseek-ai/DeepSeek-V2-Lite-Chat on (1, 4) mesh
Creating model stage 0 of 1
Creating model stage 0 of 1
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.28k/1.28k [00:00<00:00, 11.6MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.61M/4.61M [00:00<00:00, 39.2MB/s]
Generating: NCCL version 2.26.5+cuda12.9
[rank3]: Traceback (most recent call last):
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank3]: generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank3]: result, tokens_generated = func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank3]: preds = model(x)
[rank3]: ^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank3]: hidden_states = self.model(
[rank3]: ^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank3]: hidden_states = decoder_layer(
[rank3]: ^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank3]: hidden_states = self.mlp(hidden_states)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank3]: y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank3]: token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank3]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank3]: _on_device_all_to_all_v(
[rank3]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank3]: kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank3]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank3]: kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank3]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank3]: self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank3]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank1]: generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank1]: result, tokens_generated = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank1]: preds = model(x)
[rank1]: ^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank1]: hidden_states = self.model(
[rank1]: ^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank1]: hidden_states = decoder_layer(
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank1]: hidden_states = self.mlp(hidden_states)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank1]: y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank1]: token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank1]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank1]: _on_device_all_to_all_v(
[rank1]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank1]: kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank1]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank1]: kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank1]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank1]: self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank1]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
[rank2]: Traceback (most recent call last):
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 377, in <module>
[rank2]: generate(model, pp_schedule, tokenizer, dist_config, messages)
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 198, in wrapper
[rank2]: result, tokens_generated = func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/generate.py", line 281, in generate
[rank2]: preds = model(x)
[rank2]: ^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1322, in forward
[rank2]: hidden_states = self.model(
[rank2]: ^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1274, in forward
[rank2]: hidden_states = decoder_layer(
[rank2]: ^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 1123, in forward
[rank2]: hidden_states = self.mlp(hidden_states)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 640, in forward
[rank2]: y = self.moe_on_device(hidden_states, topk_idx, topk_weight)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/model.py", line 829, in moe_on_device
[rank2]: token_gather_buf, output_splits = OnDeviceAllToAllV.apply(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/torch/autograd/function.py", line 576, in apply
[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 205, in forward
[rank2]: _on_device_all_to_all_v(
[rank2]: File "/data/users/less/torchtitan/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py", line 145, in _on_device_all_to_all_v
[rank2]: kernel = on_device_all_to_all_v_kernel[(num_blocks, 1, 1)](
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 347, in <lambda>
[rank2]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/runtime/jit.py", line 591, in run
[rank2]: kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
[rank2]: File "/home/less/.conda/envs/pycutlass/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 529, in __call__
[rank2]: self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, global_scratch, *args)
[rank2]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
Versions
titan Main - May 26,2025
May 15 torch nightly
B200 devserver
The text was updated successfully, but these errors were encountered:
Bug description
Running deepseek inference (which uses all2allV Triton kernel) fails out with an invalid pointer access.
This is likely due to running on Blackwell as works fine on H100.
Full error:
Versions
titan Main - May 26,2025
May 15 torch nightly
B200 devserver
The text was updated successfully, but these errors were encountered: