Cannot serve modelopt quantized nvfp4 model on TensorRT LLM #187

enisaras · 2025-04-27T18:50:50Z

Describe the bug

After quantizing Llama-3.1-70B-Instruct model using modelopt hf_ptq script, running into an error:

[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 40794 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.49 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.54 GB GPU memory for decoder.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  Cannot determine size of FP4 data type (/code/tensorrt_llm/cpp/include/tensorrt_llm/common/dataType.h:40)
1       0x7f973fe01d33 /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xa1dd33) [0x7f973fe01d33]

Steps/Code to reproduce bug

Quantize Llama-3.1-70B-Instruct using the helper script in modelopt repository:

python hf_ptq.py --pyt_ckpt_path meta-llama/Llama-3.1-70B-Instruct --qformat nvfp4 --batch_size 32 --kv_cache_qformat nvfp4

Build TensorRT LLM engine using the trt-llm build command:

trtllm-build --checkpoint-dir exported_model/

Serve the engine built in step 2:

trtllm-serve serve engine_outputs/ --tokenizer meta-llama/Llama-3.1-70B-Instruct --log_level debug --tp_size 2

Step 3 fails with the following logs:

2025-04-27 18:23:15,796 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc1
[04/27/2025-18:23:16] [TRT-LLM] [W] Overriding LlmArgs.max_input_len (annotation=int required=False default=1024 description='The maximum input length.') with build_config.max_input_len (1024).
[04/27/2025-18:23:16] [TRT-LLM] [W] Overriding LlmArgs.max_seq_len (annotation=Union[int, NoneType] required=False default=None description='The maximum sequence length.') with build_config.max_seq_len (None).
[04/27/2025-18:23:16] [TRT-LLM] [W] Overriding LlmArgs.max_beam_width (annotation=int required=False default=1 description='The maximum beam width.') with build_config.max_beam_width (1).
[04/27/2025-18:23:16] [TRT-LLM] [I] Compute capability: (10, 0)
[04/27/2025-18:23:16] [TRT-LLM] [I] SM count: 148
[04/27/2025-18:23:16] [TRT-LLM] [I] SM clock: 1965 MHz
[04/27/2025-18:23:16] [TRT-LLM] [I] int4 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] int8 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] fp8 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] float16 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] float32 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] Total Memory: 179 GiB
[04/27/2025-18:23:16] [TRT-LLM] [I] Memory clock: 3996 MHz
[04/27/2025-18:23:16] [TRT-LLM] [I] Memory bus width: 7680
[04/27/2025-18:23:16] [TRT-LLM] [I] Memory bandwidth: 7672 GB/s
[04/27/2025-18:23:16] [TRT-LLM] [I] NVLink is active: True
[04/27/2025-18:23:16] [TRT-LLM] [I] NVLink version: 4
[04/27/2025-18:23:16] [TRT-LLM] [I] NVLink bandwidth: 450 GB/s
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.fc_after_embed = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_input_layernorm_in_first_layer = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_last_layernorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.layer_idx_offset = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.has_partial_lora_mask = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.27.1'}
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.share_embedding_table = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dtype to bfloat16.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_plugin to nvfp4.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set qserve_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set identity_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set nccl_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set lora_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dora_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_plugins to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set moe_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set remove_input_padding to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set norm_quant_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set reduce_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set user_buffer to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set tokens_per_block to 32.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set multiple_profiles to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_state to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set streamingllm to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set manage_weights to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fused_mlp to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[04/27/2025-18:23:16] [TRT-LLM] [W] The build_config is ignored for model format of TLLM_ENGINE.
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.fc_after_embed = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_input_layernorm_in_first_layer = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_last_layernorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.layer_idx_offset = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.has_partial_lora_mask = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.27.1'}
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.share_embedding_table = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dtype to bfloat16.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_plugin to nvfp4.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set qserve_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set identity_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set nccl_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set lora_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dora_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_plugins to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set moe_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set remove_input_padding to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set norm_quant_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set reduce_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set user_buffer to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set tokens_per_block to 32.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set multiple_profiles to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_state to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set streamingllm to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set manage_weights to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fused_mlp to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set nccl_plugin to None.
rank 0 using MpiPoolSession to spawn MPI processes
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
2025-04-27 18:23:23,215 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.20.0rc1 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 40815 MiB
[TensorRT-LLM][INFO] Engine load time 9934 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1885.04 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 40794 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.49 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.54 GB GPU memory for decoder.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  Cannot determine size of FP4 data type (/code/tensorrt_llm/cpp/include/tensorrt_llm/common/dataType.h:40)
1       0x7f973fe01d33 /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xa1dd33) [0x7f973fe01d33]
2       0x7f9740c7fdf3 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3635
3       0x7f9740d691ce tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 654
4       0x7f9740d4dc39 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185
5       0x7f9740d4ece5 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1173
6       0x7f9740d55dfa tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474
7       0x7f9740d3a477 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87
8       0x7f976d2136e0 /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x13f6e0) [0x7f976d2136e0]
9       0x7f976d190af3 /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xbcaf3) [0x7f976d190af3]
10            0x58208f /usr/bin/python() [0x58208f]
11            0x549185 _PyObject_MakeTpCall + 117
12            0x54cea7 /usr/bin/python() [0x54cea7]
13            0x59e231 /usr/bin/python() [0x59e231]
14            0x599b63 /usr/bin/python() [0x599b63]
15      0x7f976d18df4d /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xb9f4d) [0x7f976d18df4d]
16            0x549185 _PyObject_MakeTpCall + 117
17            0x5d73c9 _PyEval_EvalFrameDefault + 2697
18            0x54aa9a _PyObject_Call_Prepend + 394
19            0x59e09f /usr/bin/python() [0x59e09f]
20            0x599b63 /usr/bin/python() [0x599b63]
21            0x54924e _PyObject_MakeTpCall + 318
22            0x5d73c9 _PyEval_EvalFrameDefault + 2697
23            0x5d58eb PyEval_EvalCode + 347
24            0x5d347c /usr/bin/python() [0x5d347c]
25            0x581f0d /usr/bin/python() [0x581f0d]
26            0x549b85 PyObject_Vectorcall + 53
27            0x5d73c9 _PyEval_EvalFrameDefault + 2697
28            0x6bcce2 /usr/bin/python() [0x6bcce2]
29            0x6bc912 Py_RunMain + 562
30            0x6bc57d Py_BytesMain + 45
31      0x7f9af03bb1ca /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f9af03bb1ca]
32      0x7f9af03bb28b __libc_start_main + 139
33            0x657ce5 _start + 37
[b200enis-devel:785754] *** Process received signal ***
[b200enis-devel:785754] Signal: Aborted (6)
[b200enis-devel:785754] Signal code:  (-6)
[b200enis-devel:785754] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f9af03d6330]
[b200enis-devel:785754] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f9af042fb2c]
[b200enis-devel:785754] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f9af03d627e]
[b200enis-devel:785754] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f9af03b98ff]
[b200enis-devel:785754] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7f987fdc1ff5]
[b200enis-devel:785754] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7f987fdd70da]
[b200enis-devel:785754] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7f987fdc18e6]
[b200enis-devel:785754] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7f987fdd68ba]
[b200enis-devel:785754] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7f98b805eb06]
[b200enis-devel:785754] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f98b805f1f1]
[b200enis-devel:785754] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x44)[0x7f987fdd7384]
[b200enis-devel:785754] [11] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xa1dcec)[0x7f973fe01cec]
[b200enis-devel:785754] [12] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatchingC1ESt10shared_ptrIN8nvinfer17ILoggerEERKNS_7runtime11ModelConfigERKNS6_11WorldConfigERKNS6_9RawEngineEbRKNS0_25TrtGptModelOptionalParamsE+0xe33)[0x7f9740c7fdf3]
[b200enis-devel:785754] [13] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager18TrtGptModelFactory6createERKNS_7runtime9RawEngineERKNS2_11ModelConfigERKNS2_11WorldConfigENS0_15TrtGptModelTypeERKNS0_25TrtGptModelOptionalParamsE+0x28e)[0x7f9740d691ce]
[b200enis-devel:785754] [14] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl11createModelERKNS_7runtime9RawEngineERKNS3_11ModelConfigERKNS3_11WorldConfigERKNS0_14ExecutorConfigE+0xb9)[0x7f9740d4dc39]
[b200enis-devel:785754] [15] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl9loadModelERKSt8optionalINSt10filesystem7__cxx114pathEERKS3_ISt17basic_string_viewIhSt11char_traitsIhEEERKNS_7runtime13GptJsonConfigERKNS0_14ExecutorConfigEbRKS3_ISt3mapINSt7__cxx1112basic_stringIcSB_IcESaIcEEENS0_6TensorESt4lessIST_ESaISt4pairIKST_SU_EEEE+0x495)[0x7f9740d4ece5]
[b200enis-devel:785754] [16] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4ImplC1ERKNSt10filesystem7__cxx114pathERKSt8optionalIS5_ENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x9aa)[0x7f9740d55dfa]
[b200enis-devel:785754] [17] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8ExecutorC1ERKNSt10filesystem7__cxx114pathENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x57)[0x7f9740d3a477]
[b200enis-devel:785754] [18] /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x13f6e0)[0x7f976d2136e0]
[b200enis-devel:785754] [19] /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xbcaf3)[0x7f976d190af3]
[b200enis-devel:785754] [20] /usr/bin/python[0x58208f]
[b200enis-devel:785754] [21] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x549185]
[b200enis-devel:785754] [22] /usr/bin/python[0x54cea7]
[b200enis-devel:785754] [23] /usr/bin/python[0x59e231]
[b200enis-devel:785754] [24] /usr/bin/python[0x599b63]
[b200enis-devel:785754] [25] /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xb9f4d)[0x7f976d18df4d]
[b200enis-devel:785754] [26] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x549185]
[b200enis-devel:785754] [27] /usr/bin/python(_PyEval_EvalFrameDefault+0xa89)[0x5d73c9]
[b200enis-devel:785754] [28] /usr/bin/python(_PyObject_Call_Prepend+0x18a)[0x54aa9a]
[b200enis-devel:785754] [29] /usr/bin/python[0x59e09f]
[b200enis-devel:785754] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

### Expected behavior

The engine starts up successfully and able to process inference requests.

System information

Container used (if applicable): TensorRT LLM container built from source using instructions here
OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.1 LTS
CPU architecture (x86_64, aarch64): x86_64
GPU name (e.g. H100, A100, L40S): NVIDIA B200
GPU memory size: 179.1 GB
Number of GPUs: 8
Library versions (if applicable):
- Python: 3.12.3
- ModelOpt version or commit hash: 0.27.1
- CUDA: 12.8
- PyTorch: 2.7.0a0+7c8ec84dab.nv25.03
- Transformers: 4.51.3
  2025-04-27 18:48:08,275 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
  [TensorRT-LLM] TensorRT-LLM version: 0.20.0rc1
- TensorRT-LLM: 0.20.0rc1
- ONNXRuntime: ?
- TensorRT: 10.9.0.34
Any other details that may help: I have quantized weights and KV cache, this feature might be missing from TensorRT LLM, but I am not entirely sure.

The text was updated successfully, but these errors were encountered:

meenchen · 2025-04-28T19:12:53Z

Hi @enisaras, could you try using fp8 kv cache? fp4 kv cache is not ready for TRT-LLM.

enisaras · 2025-04-29T02:04:42Z

Got it, do I need to create feature request in TensorRT-LLM repository or is this already on the roadmap?

enisaras added the bug Something isn't working label Apr 27, 2025

enisaras changed the title ~~Cannot run modelopt quantize nvfp4 model on TensorRT LLM~~ Cannot run modelopt quantized nvfp4 model on TensorRT LLM Apr 27, 2025

enisaras changed the title ~~Cannot run modelopt quantized nvfp4 model on TensorRT LLM~~ Cannot serve modelopt quantized nvfp4 model on TensorRT LLM Apr 27, 2025

kevalmorabia97 assigned meenchen Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot serve modelopt quantized nvfp4 model on TensorRT LLM #187

Cannot serve modelopt quantized nvfp4 model on TensorRT LLM #187

enisaras commented Apr 27, 2025

meenchen commented Apr 28, 2025

enisaras commented Apr 29, 2025

Cannot serve modelopt quantized nvfp4 model on TensorRT LLM #187

Cannot serve modelopt quantized nvfp4 model on TensorRT LLM #187

Comments

enisaras commented Apr 27, 2025

Describe the bug

Steps/Code to reproduce bug

System information

meenchen commented Apr 28, 2025

enisaras commented Apr 29, 2025