Skip to content

Tensorrt optimization shows unexpected results #4405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
geraldstanje opened this issue Mar 31, 2025 · 5 comments
Open

Tensorrt optimization shows unexpected results #4405

geraldstanje opened this issue Mar 31, 2025 · 5 comments

Comments

@geraldstanje
Copy link

geraldstanje commented Mar 31, 2025

Hi,

i try to create tensorrt engine from an onnx model.

I tried a few things and here are the inference latencies. Why is 3. and 4. performing worse than 2?

  1. FP32 (default): 5.2ms
  2. FP16: 2.7ms
  3. INT8: 5.8ms
  4. FP16 + buildOptimizationLevel=5: 2.7ms

trtexec runs:

  1. FP32
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
  1. FP16
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --fp16 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
  1. INT8
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --int8 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
  1. FP16 + buildOptimizationLevel=5
#!/bin/bash

ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000

alias trtexec="/usr/src/tensorrt/bin/trtexe"

# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
    --onnx=${ONNX_MODEL_NAME} \
    --saveEngine=${TRT_MODEL_NAME} \
    --fp16 \
    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --optShapes=input_ids:1x100,attention_mask:1x100 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \
    --memPoolSize=workspace:${WORKSPACE} \
    --builderOptimizationLevel=5 \
    --verbose \
| tee conversion.txt

# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile

Logs:
trt_fp16.txt
trt_fp32.txt
trt_fp16_optimization_5.txt.zip
trt_int8.txt.zip

Environment

Triton Inference Server Version: 25.02

TensorRT Version: 10.8.0.43 (i think thats the version which comes with Triton Inference Server Version 25.02 - see: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html)

trtexec: v100800

NVIDIA GPU: Nvidia A10

nvidia-smi

Mon Mar 31 03:23:10 2025      

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   18C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

polygraphy inspect model model.onnx      
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 14
    ---- 2 Graph Input(s) ----
    {input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
     attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}

    ---- 1 Graph Output(s) ----
    {logits [dtype=float32, shape=('batch_size', 2)]}

    ---- 174 Initializer(s) ----

    ---- 4152 Node(s) ----
polygraphy run model.onnx --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt
[I] onnxrt-runner-N0-03/31/25-03:53:39  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-03:53:39
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39  | Completed 1 iteration(s) in 34.52 ms | Average inference time: 34.52 ms.
[I] PASSED | Runtime: 4.336s | Command: /usr/local/bin/polygraphy run model.onnx –onnxrt
polygraphy run model.onnx --trt --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-03:52:36     | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
        Profile 0:
            {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
             attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 11.023 seconds
[I] trt-runner-N0-03/31/25-03:52:36   
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-03:52:36   
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-03:52:36     | Completed 1 iteration(s) in 16.38 ms | Average inference time: 16.38 ms.
[I] onnxrt-runner-N0-03/31/25-03:52:36  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-03/31/25-03:52:36
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36  | Completed 1 iteration(s) in 31.44 ms | Average inference time: 31.44 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36
[I]     Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         onnxrt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         Error Metrics: logits
[I]             Minimum Required Tolerance: elemwise error | [abs=3.9339e-06] OR [rel=2.622e-06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=3.6955e-06, std-dev=2.3842e-07, var=5.6843e-14, median=3.6955e-06, min=3.4571e-06 at (0, 0), max=3.9339e-06 at (0, 1), avg-magnitude=3.6955e-06, p90=3.8862e-06, p95=3.9101e-06, p99=3.9291e-06
[I]             Relative Difference | Stats: mean=2.369e-06, std-dev=2.5293e-07, var=6.3972e-14, median=2.369e-06, min=2.1161e-06 at (0, 0), max=2.622e-06 at (0, 1), avg-magnitude=2.369e-06, p90=2.5714e-06, p95=2.5967e-06, p99=2.6169e-06
[I]         PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 21.399s | Command: /usr/local/bin/polygraphy run model.onnx --trt –onnxrt
polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] onnxrt-runner-N0-03/31/25-04:15:38  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:15:40.686638615 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:15:40.697494382 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:15:40.697516462 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
    If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-04:15:38 
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38 
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38  | Completed 1 iteration(s) in 145.6 ms | Average inference time: 145.6 ms.
[I] PASSED | Runtime: 2.972s | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda
polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-04:14:34     | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
        Profile 0:
            {input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
             attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 10.985 seconds
[I] trt-runner-N0-03/31/25-04:14:34    
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-04:14:34    
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-04:14:34     | Completed 1 iteration(s) in 16.65 ms | Average inference time: 16.65 ms.
[I] onnxrt-runner-N0-03/31/25-04:14:34  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:14:53.591558834 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:14:53.603006616 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:14:53.603031695 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[I] onnxrt-runner-N0-03/31/25-04:14:34 
    ---- Inference Input(s) ----
    {input_ids [dtype=int64, shape=(1, 1)],
     attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34 
    ---- Inference Output(s) ----
    {logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34  | Completed 1 iteration(s) in 68.37 ms | Average inference time: 68.37 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34
[I]     Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         onnxrt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066663, std-dev=1.567, var=2.4556, median=-0.066663, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I]         Error Metrics: logits
[I]             Minimum Required Tolerance: elemwise error | [abs=2.0266e-06] OR [rel=1.3507e-06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.5497e-06, std-dev=4.7684e-07, var=2.2737e-13, median=1.5497e-06, min=1.0729e-06 at (0, 0), max=2.0266e-06 at (0, 1), avg-magnitude=1.5497e-06, p90=1.9312e-06, p95=1.9789e-06, p99=2.017e-06
[I]             Relative Difference | Stats: mean=1.0037e-06, std-dev=3.4699e-07, var=1.204e-13, median=1.0037e-06, min=6.5672e-07 at (0, 0), max=1.3507e-06 at (0, 1), avg-magnitude=1.0037e-06, p90=1.2813e-06, p95=1.316e-06, p99=1.3438e-06
[I]         PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 19.832s | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
pip list
Package                  Version
------------------------ -------------
annotated-types          0.7.0
anyio                    4.8.0
astunparse               1.6.3
blinker                  1.7.0
certifi                  2025.1.31
colored                  2.3.0
coloredlogs              15.0.1
cryptography             41.0.7
cupy-cuda12x             13.3.0
dbus-python              1.3.2
distlib                  0.3.9
distro                   1.9.0
dm-tree                  0.1.8
fastapi                  0.115.6
fastrlock                0.8.3
filelock                 3.17.0
flatbuffers              25.2.10
gast                     0.6.0
h11                      0.14.0
httpcore                 1.0.7
httplib2                 0.20.4
httpx                    0.27.2
humanfriendly            10.0
idna                     3.10
iniconfig                2.0.0
jiter                    0.8.2
launchpadlib             1.11.0
lazr.restfulclient       0.14.6
lazr.uri                 1.0.6
mpmath                   1.3.0
numpy                    1.26.4
nvidia-cuda-runtime-cu12 12.8.90
nvidia-dali-cuda120      1.44.0
nvidia-nvimgcodec-cu12   0.3.0.5
oauthlib                 3.2.2
onnx                     1.17.0
onnxruntime-gpu          1.21.0
openai                   1.60.0
packaging                24.2
pip                      24.0
platformdirs             4.3.6
pluggy                   1.5.0
polygraphy               0.49.20
protobuf                 6.30.2
pydantic                 2.10.6
pydantic_core            2.27.2
PyGObject                3.48.2
PyJWT                    2.7.0
pyparsing                3.1.1
pytest                   8.3.4
python-apt               2.7.7+ubuntu4
setuptools               68.1.2
six                      1.16.0
sniffio                  1.3.1
starlette                0.41.3
sympy                    1.13.3
tensorrt                 10.9.0.34
tensorrt-cu12            10.9.0.34
tensorrt_cu12_bindings   10.9.0.34
tensorrt_cu12_libs       10.9.0.34
tqdm                     4.67.1
tritonfrontend           2.55.0
tritonserver             0.0.0
typing_extensions        4.12.2
virtualenv               20.29.2
wadllib                  1.3.6
wheel                    0.45.1

cc @lix19937

@geraldstanje geraldstanje changed the title Tensorrt optimization doesnt work Tensorrt optimization shows unexpected results Mar 31, 2025
@lix19937
Copy link

lix19937 commented Apr 2, 2025

Obversiously,
Case 3, Precision: FP32+INT8,and most of them are still running FP32 from log, so latency > Case 2

Case 4 Slightly better than Case2

Your workspace size can be use deafult or larger.

@geraldstanje
Copy link
Author

geraldstanje commented Apr 2, 2025

@lix19937 regarding case 3 - how to make it int8 only? or any optimizations which make it faster than case 2?

"Your workspace size can be use deafult or larger." how is larger better? i see the generated engine is around 700MB only and my workspace size is 14 GB....

here is my model info:

polygraphy inspect model model.onnx
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 14
    ---- 2 Graph Input(s) ----
    {input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
     attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}
    ---- 1 Graph Output(s) ----
    {logits [dtype=float32, shape=('batch_size', 2)]}
    ---- 174 Initializer(s) ----

@lix19937
Copy link

lix19937 commented Apr 3, 2025

In case 3, you can use trtexec --best --onnx=spec to build plan.

@geraldstanje
Copy link
Author

geraldstanje commented Apr 3, 2025

@lix19937 I run what you suggested - it looks like it gets same latency as 2. So it cannot be further improved with quantization?

Here are the logs:
trt_best_logs.txt.zip

also it looks like Loaded engine size: 760 MiB...

@geraldstanje
Copy link
Author

geraldstanje commented Apr 8, 2025

@lix19937 any idea to the above?

also, what could be the reason that inference with onnx model is faster than tensorrt with all default settings (except minShapes and maxShapes)?
could it be because of dynamic shapes?

    --minShapes=input_ids:1x1,attention_mask:1x1 \
    --maxShapes=input_ids:1x4000,attention_mask:1x4000 \

if minShapes and maxShapes are not set it will default to 1x1?

but onnx doesnt define a shape:

polygraphy inspect model model.onnx      
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 14
    ---- 2 Graph Input(s) ----
    {input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
     attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}

    ---- 1 Graph Output(s) ----
    {logits [dtype=float32, shape=('batch_size', 2)]}

    ---- 174 Initializer(s) ----

    ---- 4152 Node(s) ----

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants