Skip to content

🐛 [Bug] Error when serving Torch-TensorRT JIT model to Nvidia-Triton #3248

@zmy1116

Description

@zmy1116

Bug Description

I'm trying to serve torch-tensorrt optimized model to Nvidia Triton server based on the provided tutorial
https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html

First the provided script to generate optimized model does not work. I tweak a bit got that to work. Then when I try to perform inference using Triton server, I got the error
ERROR: [Torch-TensorRT] - IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)

To Reproduce

So the pytorch page provide the followoing script to save the optimized jit model

import torch
import torch_tensorrt
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

# load model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")

# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
    inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
    enabled_precisions= { torch.half} # Run with FP32
)

# Save the model
torch.jit.save(trt_model, "model.pt")

When I run this script, I got the error AttributeError: 'GraphModule' object has no attribute 'save

To resolve this I tried the following 2 ways

  1. Save model with torch_tensorrt.save
    torch.jit.save(trt_model._run_on_acc_0, "/home/ubuntu/model.pt")

  2. compile a traced jit model directly

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")
model_jit = torch.jit.trace(model, [torch.rand(1,3,224,224).cuda()])
trt_model = torch_tensorrt.compile(model,
    inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
    enabled_precisions= { torch.half} # Run with FP32
)

I confirm both methods create jit model correctly.

I then put model in folder with the same structure the tutorial provides. Launch the triton server. The triton server launch successfully.

I1018 03:38:23.657822 1 server.cc:674] 
+----------+---------+--------+
| Model    | Version | Status |
+----------+---------+--------+
| resnet50 | 1       | READY  |
+----------+---------+--------+

I1018 03:38:23.886797 1 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA L4"
I1018 03:38:23.886839 1 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA L4"
I1018 03:38:23.886852 1 metrics.cc:877] "Collecting metrics for GPU 2: NVIDIA L4"
I1018 03:38:23.886864 1 metrics.cc:877] "Collecting metrics for GPU 3: NVIDIA L4"
I1018 03:38:23.886873 1 metrics.cc:877] "Collecting metrics for GPU 4: NVIDIA L4"
I1018 03:38:23.886882 1 metrics.cc:877] "Collecting metrics for GPU 5: NVIDIA L4"
I1018 03:38:23.886893 1 metrics.cc:877] "Collecting metrics for GPU 6: NVIDIA L4"
I1018 03:38:23.886901 1 metrics.cc:877] "Collecting metrics for GPU 7: NVIDIA L4"
I1018 03:38:23.916949 1 metrics.cc:770] "Collecting CPU metrics"
I1018 03:38:23.917116 1 tritonserver.cc:2598] 

+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.50.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /home/ubuntu/model_repository_4                                                                                                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{4}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{5}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{6}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{7}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

However, when I perform infernece, I got error
ERROR: [Torch-TensorRT] - IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)

Expected behavior

I expect the inference to succeed. I want to serve Torch-TensorRT optimized model on Nvidia-Triton. Our team observed that, on models like SAM2, Torch-TensorRT is significantly faster than (Torch -> onnx -> TensorRT) converted model. Our entire inference stack is on Nvidia-Triton, and we would like to take advantage of this new tool.

Environment

We use directly Nvidia NGC docker.
Pytorch for model optimiztion: nvcr.io/nvidia/pytorch:24.09-py3
Triton for hosting: nvcr.io/nvidia/tritonserver:24.09-py3

Additional context

Actually our current stack is on tritonserver:24.03, and we tested that it does not work with nvcr.io/nvidia/tritonserver:24.03py3 and nvcr.io/nvidia/pytorch:24.03-py3

Pleaes let us know if you need additional information

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions