Tensorrt optimization shows unexpected results #4405

geraldstanje · 2025-03-31T03:29:24Z

Hi,

i try to create tensorrt engine from an onnx model.

I tried a few things and here are the inference latencies. Why is 3. and 4. performing worse than 2?

FP32 (default): 5.2ms
FP16: 2.7ms
INT8: 5.8ms
FP16 + buildOptimizationLevel=5: 2.7ms

trtexec runs:

FP32

#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile

FP16

#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --fp16 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile

INT8

#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --int8 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile

FP16 + buildOptimizationLevel=5

#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --fp16 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --builderOptimizationLevel=5 \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile

Logs:
trt_fp16.txt
trt_fp32.txt
trt_fp16_optimization_5.txt.zip
trt_int8.txt.zip

Environment

Triton Inference Server Version: 25.02

TensorRT Version: 10.8.0.43 (i think thats the version which comes with Triton Inference Server Version 25.02 - see: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html)

trtexec: v100800

NVIDIA GPU: Nvidia A10

nvidia-smi

Mon Mar 31 03:23:10 2025      

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   18C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

polygraphy inspect model model.onnx      
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 14
    ---- 2 Graph Input(s) ----
    {input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
     attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}

    ---- 1 Graph Output(s) ----
    {logits [dtype=float32, shape=('batch_size', 2)]}

    ---- 174 Initializer(s) ----

    ---- 4152 Node(s) ----

polygraphy run model.onnx --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt
[I] onnxrt-runner-N0-03/31/25-03:53:39  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-03:53:39
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39  | Completed 1 iteration(s) in 34.52 ms | Average inference time: 34.52 ms.
[I] PASSED | Runtime: 4.336s | Command: /usr/local/bin/polygraphy run model.onnx –onnxrt

polygraphy run model.onnx --trt --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-03:52:36     | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
        Profile 0:
            {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
             attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 11.023 seconds
[I] trt-runner-N0-03/31/25-03:52:36   
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-03:52:36   
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-03:52:36     | Completed 1 iteration(s) in 16.38 ms | Average inference time: 16.38 ms.
[I] onnxrt-runner-N0-03/31/25-03:52:36  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-03/31/25-03:52:36
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36  | Completed 1 iteration(s) in 31.44 ms | Average inference time: 31.44 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36
[I]     Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         onnxrt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         Error Metrics: logits
[I]             Minimum Required Tolerance: elemwise error | [abs=3.9339e-06] OR [rel=2.622e-06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=3.6955e-06, std-dev=2.3842e-07, var=5.6843e-14, median=3.6955e-06, min=3.4571e-06 at (0, 0), max=3.9339e-06 at (0, 1), avg-magnitude=3.6955e-06, p90=3.8862e-06, p95=3.9101e-06, p99=3.9291e-06
[I]             Relative Difference | Stats: mean=2.369e-06, std-dev=2.5293e-07, var=6.3972e-14, median=2.369e-06, min=2.1161e-06 at (0, 0), max=2.622e-06 at (0, 1), avg-magnitude=2.369e-06, p90=2.5714e-06, p95=2.5967e-06, p99=2.6169e-06
[I]         PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 21.399s | Command: /usr/local/bin/polygraphy run model.onnx --trt –onnxrt

polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] onnxrt-runner-N0-03/31/25-04:15:38  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:15:40.686638615 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:15:40.697494382 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:15:40.697516462 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-04:15:38 
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38 
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38  | Completed 1 iteration(s) in 145.6 ms | Average inference time: 145.6 ms.
[I] PASSED | Runtime: 2.972s | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda

polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-04:14:34     | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
        Profile 0:
            {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
             attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 10.985 seconds
[I] trt-runner-N0-03/31/25-04:14:34    
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-04:14:34    
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-04:14:34     | Completed 1 iteration(s) in 16.65 ms | Average inference time: 16.65 ms.
[I] onnxrt-runner-N0-03/31/25-04:14:34  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:14:53.591558834 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:14:53.603006616 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:14:53.603031695 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[I] onnxrt-runner-N0-03/31/25-04:14:34 
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34 
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34  | Completed 1 iteration(s) in 68.37 ms | Average inference time: 68.37 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34
[I]     Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         onnxrt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066663, std-dev=1.567, var=2.4556, median=-0.066663, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         Error Metrics: logits
[I]             Minimum Required Tolerance: elemwise error | [abs=2.0266e-06] OR [rel=1.3507e-06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.5497e-06, std-dev=4.7684e-07, var=2.2737e-13, median=1.5497e-06, min=1.0729e-06 at (0, 0), max=2.0266e-06 at (0, 1), avg-magnitude=1.5497e-06, p90=1.9312e-06, p95=1.9789e-06, p99=2.017e-06
[I]             Relative Difference | Stats: mean=1.0037e-06, std-dev=3.4699e-07, var=1.204e-13, median=1.0037e-06, min=6.5672e-07 at (0, 0), max=1.3507e-06 at (0, 1), avg-magnitude=1.0037e-06, p90=1.2813e-06, p95=1.316e-06, p99=1.3438e-06
[I]         PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 19.832s | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda

pip list
Package                  Version
------------------------ -------------
annotated-types          0.7.0
anyio                    4.8.0
astunparse               1.6.3
blinker                  1.7.0
certifi                  2025.1.31
colored                  2.3.0
coloredlogs              15.0.1
cryptography             41.0.7
cupy-cuda12x             13.3.0
dbus-python              1.3.2
distlib                  0.3.9
distro                   1.9.0
dm-tree                  0.1.8
fastapi                  0.115.6
fastrlock                0.8.3
filelock                 3.17.0
flatbuffers              25.2.10
gast                     0.6.0
h11                      0.14.0
httpcore                 1.0.7
httplib2                 0.20.4
httpx                    0.27.2
humanfriendly            10.0
idna                     3.10
iniconfig                2.0.0
jiter                    0.8.2
launchpadlib             1.11.0
lazr.restfulclient       0.14.6
lazr.uri                 1.0.6
mpmath                   1.3.0
numpy                    1.26.4
nvidia-cuda-runtime-cu12 12.8.90
nvidia-dali-cuda120      1.44.0
nvidia-nvimgcodec-cu12   0.3.0.5
oauthlib                 3.2.2
onnx                     1.17.0
onnxruntime-gpu          1.21.0
openai                   1.60.0
packaging                24.2
pip                      24.0
platformdirs             4.3.6
pluggy                   1.5.0
polygraphy               0.49.20
protobuf                 6.30.2
pydantic                 2.10.6
pydantic_core            2.27.2
PyGObject                3.48.2
PyJWT                    2.7.0
pyparsing                3.1.1
pytest                   8.3.4
python-apt               2.7.7+ubuntu4
setuptools               68.1.2
six                      1.16.0
sniffio                  1.3.1
starlette                0.41.3
sympy                    1.13.3
tensorrt                 10.9.0.34
tensorrt-cu12            10.9.0.34
tensorrt_cu12_bindings   10.9.0.34
tensorrt_cu12_libs       10.9.0.34
tqdm                     4.67.1
tritonfrontend           2.55.0
tritonserver             0.0.0
typing_extensions        4.12.2
virtualenv               20.29.2
wadllib                  1.3.6
wheel                    0.45.1

cc @lix19937

The text was updated successfully, but these errors were encountered:

lix19937 · 2025-04-02T07:02:34Z

Obversiously,
Case 3, Precision: FP32+INT8,and most of them are still running FP32 from log, so latency > Case 2

Case 4 Slightly better than Case2

Your workspace size can be use deafult or larger.

geraldstanje · 2025-04-02T14:01:41Z

@lix19937 regarding case 3 - how to make it int8 only? or any optimizations which make it faster than case 2?

"Your workspace size can be use deafult or larger." how is larger better? i see the generated engine is around 700MB only and my workspace size is 14 GB....

here is my model info:

polygraphy inspect model model.onnx
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 14
    ---- 2 Graph Input(s) ----
    {input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
     attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}
    ---- 1 Graph Output(s) ----
    {logits [dtype=float32, shape=('batch_size', 2)]}
    ---- 174 Initializer(s) ----

lix19937 · 2025-04-03T00:37:49Z

In case 3, you can use trtexec --best --onnx=spec to build plan.

geraldstanje · 2025-04-03T02:04:31Z

@lix19937 I run what you suggested - it looks like it gets same latency as 2. So it cannot be further improved with quantization?

Here are the logs:
trt_best_logs.txt.zip

also it looks like Loaded engine size: 760 MiB...

geraldstanje · 2025-04-08T03:33:24Z

@lix19937 any idea to the above?

also, what could be the reason that inference with onnx model is faster than tensorrt with all default settings (except minShapes and maxShapes)?
could it be because of dynamic shapes?

    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \

if minShapes and maxShapes are not set it will default to 1x1?

but onnx doesnt define a shape:

polygraphy inspect model model.onnx      
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 14
    ---- 2 Graph Input(s) ----
    {input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
     attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}

    ---- 1 Graph Output(s) ----
    {logits [dtype=float32, shape=('batch_size', 2)]}

    ---- 174 Initializer(s) ----

    ---- 4152 Node(s) ----

geraldstanje changed the title ~~Tensorrt optimization doesnt work~~ Tensorrt optimization shows unexpected results Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorrt optimization shows unexpected results #4405

Tensorrt optimization shows unexpected results #4405

geraldstanje commented Mar 31, 2025 •

edited

Loading

lix19937 commented Apr 2, 2025 •

edited

Loading

geraldstanje commented Apr 2, 2025 •

edited

Loading

lix19937 commented Apr 3, 2025

geraldstanje commented Apr 3, 2025 •

edited

Loading

geraldstanje commented Apr 8, 2025 •

edited

Loading

Tensorrt optimization shows unexpected results #4405

Tensorrt optimization shows unexpected results #4405

Comments

geraldstanje commented Mar 31, 2025 • edited Loading

Environment

Relevant Files

Steps To Reproduce

lix19937 commented Apr 2, 2025 • edited Loading

geraldstanje commented Apr 2, 2025 • edited Loading

lix19937 commented Apr 3, 2025

geraldstanje commented Apr 3, 2025 • edited Loading

geraldstanje commented Apr 8, 2025 • edited Loading

geraldstanje commented Mar 31, 2025 •

edited

Loading

lix19937 commented Apr 2, 2025 •

edited

Loading

geraldstanje commented Apr 2, 2025 •

edited

Loading

geraldstanje commented Apr 3, 2025 •

edited

Loading

geraldstanje commented Apr 8, 2025 •

edited

Loading