Skip to content

RMSNorm failed with error code no kernel image is available for execution on the device #920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nanmi opened this issue Mar 7, 2025 · 2 comments

Comments

@nanmi
Copy link

nanmi commented Mar 7, 2025

[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025022500
[TensorRT-LLM][INFO] Refreshed the MPI local session
Model init total -- 387.22s
2025-03-07 11:03:44,669 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-03-07 11:03:44,689 - INFO - flashinfer.jit: Finished loading JIT ops: norm
CUDA Error: no kernel image is available for execution on the device (209) /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/norm.cuh: line 113 at function cudaLaunchKernel((void*)kernel, nblks, nthrs, args, smem_size, stream)
[03/07/2025-11:03:44] [TRT-LLM] [E] Failed to initialize executor on rank 6: RMSNorm failed with error code no kernel image is available for execution on the device
[ERROR    | TRT-LLM            ]: [TRT-LLM] [E] Failed to initialize executor on rank 6: RMSNorm failed with error code no kernel image is available for execution on the device
[03/07/2025-11:03:44] [TRT-LLM] [E] Traceback (most recent call last):

  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 581, in worker_main
    worker: ExecutorBindingsWorker = worker_cls(
                                     ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 126, in __init__
    self.engine = _create_engine()
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 122, in _create_engine
    return create_executor(executor_config=executor_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 106, in create_py_executor
    kv_cache_max_tokens = estimate_max_kv_cache_tokens(model_engine,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 118, in estimate_max_kv_cache_tokens
    model_engine.forward(req, resource_manager)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 990, in forward
    return self._forward_step(inputs, gather_ids)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1030, in _forward_step
    logits = self.model.forward(**inputs,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 234, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 379, in forward
    hidden_states, residual = decoder_layer(position_ids=position_ids,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 319, in forward
    hidden_states = self.input_layernorm(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/rms_norm.py", line 32, in forward
    return flashinfer_rmsnorm(hidden_states, self.weight,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 637, in __call__
    return self._opoverload(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 305, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 738, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 337, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/custom_op.py", line 237, in flashinfer_rmsnorm
    return rmsnorm(input, weight, eps)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/norm.py", line 73, in rmsnorm
    _rmsnorm(out, input, weight, eps)
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/norm.py", line 82, in _rmsnorm
    get_norm_module().rmsnorm(out, input, weight, eps, get_cuda_stream(device))
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device

CUDA: 12.8
Pytorch: 2.6.0+cu128
GPU: H20

@yzh119
Copy link
Collaborator

yzh119 commented Mar 8, 2025

Seems H20's cuda architecture is not recognized, can you specify the environment variable:

export TORCH_CUDA_ARCH_LIST=9.0

@nanmi
Copy link
Author

nanmi commented Mar 10, 2025

Seems H20's cuda architecture is not recognized, can you specify the environment variable:

export TORCH_CUDA_ARCH_LIST=9.0

not work, I have tried setting the environment variable,and additionally tried both JIT and AOT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants