Skip to content

Tensorrt optimization shows unexpected results #4405

@geraldstanje

Description

@geraldstanje

Hi,

i try to create tensorrt engine from an onnx model.

I tried a few things and here are the inference latencies. Why is 3. and 4. performing worse than 2?

  1. FP32 (default): 5.2ms
  2. FP16: 2.7ms
  3. INT8: 5.8ms
  4. FP16 + buildOptimizationLevel=5: 2.7ms

trtexec runs:

  1. FP32
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
  1. FP16
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --fp16 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
  1. INT8
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --int8 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
  1. FP16 + buildOptimizationLevel=5
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --fp16 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --builderOptimizationLevel=5 \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile

Logs:
trt_fp16.txt
trt_fp32.txt
trt_fp16_optimization_5.txt.zip
trt_int8.txt.zip

Environment

Triton Inference Server Version: 25.02

TensorRT Version: 10.8.0.43 (i think thats the version which comes with Triton Inference Server Version 25.02 - see: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html)

trtexec: v100800

NVIDIA GPU: Nvidia A10

nvidia-smi

Mon Mar 31 03:23:10 2025      

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   18C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

polygraphy inspect model model.onnx      
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 14
    ---- 2 Graph Input(s) ----
    {input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
     attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}

    ---- 1 Graph Output(s) ----
    {logits [dtype=float32, shape=('batch_size', 2)]}

    ---- 174 Initializer(s) ----

    ---- 4152 Node(s) ----
polygraphy run model.onnx --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt
[I] onnxrt-runner-N0-03/31/25-03:53:39  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-03:53:39
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39  | Completed 1 iteration(s) in 34.52 ms | Average inference time: 34.52 ms.
[I] PASSED | Runtime: 4.336s | Command: /usr/local/bin/polygraphy run model.onnx –onnxrt
polygraphy run model.onnx --trt --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-03:52:36     | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
        Profile 0:
            {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
             attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 11.023 seconds
[I] trt-runner-N0-03/31/25-03:52:36   
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-03:52:36   
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-03:52:36     | Completed 1 iteration(s) in 16.38 ms | Average inference time: 16.38 ms.
[I] onnxrt-runner-N0-03/31/25-03:52:36  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-03/31/25-03:52:36
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36  | Completed 1 iteration(s) in 31.44 ms | Average inference time: 31.44 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36
[I]     Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         onnxrt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         Error Metrics: logits
[I]             Minimum Required Tolerance: elemwise error | [abs=3.9339e-06] OR [rel=2.622e-06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=3.6955e-06, std-dev=2.3842e-07, var=5.6843e-14, median=3.6955e-06, min=3.4571e-06 at (0, 0), max=3.9339e-06 at (0, 1), avg-magnitude=3.6955e-06, p90=3.8862e-06, p95=3.9101e-06, p99=3.9291e-06
[I]             Relative Difference | Stats: mean=2.369e-06, std-dev=2.5293e-07, var=6.3972e-14, median=2.369e-06, min=2.1161e-06 at (0, 0), max=2.622e-06 at (0, 1), avg-magnitude=2.369e-06, p90=2.5714e-06, p95=2.5967e-06, p99=2.6169e-06
[I]         PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 21.399s | Command: /usr/local/bin/polygraphy run model.onnx --trt –onnxrt
polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] onnxrt-runner-N0-03/31/25-04:15:38  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:15:40.686638615 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:15:40.697494382 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:15:40.697516462 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-04:15:38 
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38 
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38  | Completed 1 iteration(s) in 145.6 ms | Average inference time: 145.6 ms.
[I] PASSED | Runtime: 2.972s | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda
polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-04:14:34     | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
        Profile 0:
            {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
             attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 10.985 seconds
[I] trt-runner-N0-03/31/25-04:14:34    
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-04:14:34    
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-04:14:34     | Completed 1 iteration(s) in 16.65 ms | Average inference time: 16.65 ms.
[I] onnxrt-runner-N0-03/31/25-04:14:34  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:14:53.591558834 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:14:53.603006616 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:14:53.603031695 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[I] onnxrt-runner-N0-03/31/25-04:14:34 
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34 
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34  | Completed 1 iteration(s) in 68.37 ms | Average inference time: 68.37 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34
[I]     Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         onnxrt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066663, std-dev=1.567, var=2.4556, median=-0.066663, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         Error Metrics: logits
[I]             Minimum Required Tolerance: elemwise error | [abs=2.0266e-06] OR [rel=1.3507e-06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.5497e-06, std-dev=4.7684e-07, var=2.2737e-13, median=1.5497e-06, min=1.0729e-06 at (0, 0), max=2.0266e-06 at (0, 1), avg-magnitude=1.5497e-06, p90=1.9312e-06, p95=1.9789e-06, p99=2.017e-06
[I]             Relative Difference | Stats: mean=1.0037e-06, std-dev=3.4699e-07, var=1.204e-13, median=1.0037e-06, min=6.5672e-07 at (0, 0), max=1.3507e-06 at (0, 1), avg-magnitude=1.0037e-06, p90=1.2813e-06, p95=1.316e-06, p99=1.3438e-06
[I]         PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 19.832s | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
pip list
Package                  Version
------------------------ -------------
annotated-types          0.7.0
anyio                    4.8.0
astunparse               1.6.3
blinker                  1.7.0
certifi                  2025.1.31
colored                  2.3.0
coloredlogs              15.0.1
cryptography             41.0.7
cupy-cuda12x             13.3.0
dbus-python              1.3.2
distlib                  0.3.9
distro                   1.9.0
dm-tree                  0.1.8
fastapi                  0.115.6
fastrlock                0.8.3
filelock                 3.17.0
flatbuffers              25.2.10
gast                     0.6.0
h11                      0.14.0
httpcore                 1.0.7
httplib2                 0.20.4
httpx                    0.27.2
humanfriendly            10.0
idna                     3.10
iniconfig                2.0.0
jiter                    0.8.2
launchpadlib             1.11.0
lazr.restfulclient       0.14.6
lazr.uri                 1.0.6
mpmath                   1.3.0
numpy                    1.26.4
nvidia-cuda-runtime-cu12 12.8.90
nvidia-dali-cuda120      1.44.0
nvidia-nvimgcodec-cu12   0.3.0.5
oauthlib                 3.2.2
onnx                     1.17.0
onnxruntime-gpu          1.21.0
openai                   1.60.0
packaging                24.2
pip                      24.0
platformdirs             4.3.6
pluggy                   1.5.0
polygraphy               0.49.20
protobuf                 6.30.2
pydantic                 2.10.6
pydantic_core            2.27.2
PyGObject                3.48.2
PyJWT                    2.7.0
pyparsing                3.1.1
pytest                   8.3.4
python-apt               2.7.7+ubuntu4
setuptools               68.1.2
six                      1.16.0
sniffio                  1.3.1
starlette                0.41.3
sympy                    1.13.3
tensorrt                 10.9.0.34
tensorrt-cu12            10.9.0.34
tensorrt_cu12_bindings   10.9.0.34
tensorrt_cu12_libs       10.9.0.34
tqdm                     4.67.1
tritonfrontend           2.55.0
tritonserver             0.0.0
typing_extensions        4.12.2
virtualenv               20.29.2
wadllib                  1.3.6
wheel                    0.45.1

cc @lix19937

Metadata

Metadata

Assignees

No one assigned

    Labels

    Module:PerformanceGeneral performance issuestriagedIssue has been triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions