-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Hi,
i try to create tensorrt engine from an onnx model.
I tried a few things and here are the inference latencies. Why is 3. and 4. performing worse than 2?
- FP32 (default): 5.2ms
- FP16: 2.7ms
- INT8: 5.8ms
- FP16 + buildOptimizationLevel=5: 2.7ms
trtexec runs:
- FP32
#!/bin/bash
ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000
alias trtexec="/usr/src/tensorrt/bin/trtexe"
# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
--onnx=${ONNX_MODEL_NAME} \
--saveEngine=${TRT_MODEL_NAME} \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:1x100,attention_mask:1x100 \
--maxShapes=input_ids:1x4000,attention_mask:1x4000 \
--memPoolSize=workspace:${WORKSPACE} \
--verbose \
| tee conversion.txt
# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
- FP16
#!/bin/bash
ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000
alias trtexec="/usr/src/tensorrt/bin/trtexe"
# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
--onnx=${ONNX_MODEL_NAME} \
--saveEngine=${TRT_MODEL_NAME} \
--fp16 \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:1x100,attention_mask:1x100 \
--maxShapes=input_ids:1x4000,attention_mask:1x4000 \
--memPoolSize=workspace:${WORKSPACE} \
--verbose \
| tee conversion.txt
# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
- INT8
#!/bin/bash
ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000
alias trtexec="/usr/src/tensorrt/bin/trtexe"
# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
--onnx=${ONNX_MODEL_NAME} \
--saveEngine=${TRT_MODEL_NAME} \
--int8 \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:1x100,attention_mask:1x100 \
--maxShapes=input_ids:1x4000,attention_mask:1x4000 \
--memPoolSize=workspace:${WORKSPACE} \
--verbose \
| tee conversion.txt
# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
- FP16 + buildOptimizationLevel=5
#!/bin/bash
ONNX_MODEL_NAME=$1 # model.onnx
TRT_MODEL_NAME=$2 # model.plan
WORKSPACE=14000
alias trtexec="/usr/src/tensorrt/bin/trtexe"
# convert onnx model to trt model
/usr/src/tensorrt/bin/trtexec \
--onnx=${ONNX_MODEL_NAME} \
--saveEngine=${TRT_MODEL_NAME} \
--fp16 \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:1x100,attention_mask:1x100 \
--maxShapes=input_ids:1x4000,attention_mask:1x4000 \
--memPoolSize=workspace:${WORKSPACE} \
--builderOptimizationLevel=5 \
--verbose \
| tee conversion.txt
# run generated trt model
/usr/src/tensorrt/bin/trtexec --loadEngine=${TRT_MODEL_NAME} --verbose #--dumpProfile
Logs:
trt_fp16.txt
trt_fp32.txt
trt_fp16_optimization_5.txt.zip
trt_int8.txt.zip
Environment
Triton Inference Server Version: 25.02
TensorRT Version: 10.8.0.43 (i think thats the version which comes with Triton Inference Server Version 25.02 - see: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html)
trtexec: v100800
NVIDIA GPU: Nvidia A10
nvidia-smi
Mon Mar 31 03:23:10 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 18C P8 15W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
NVIDIA Driver Version:
CUDA Version:
CUDNN Version:
Operating System:
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link:
Steps To Reproduce
Commands or scripts:
Have you tried the latest release?:
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):
polygraphy inspect model model.onnx
[I] Loading model: /workspace/model.onnx
[I] ==== ONNX Model ====
Name: main_graph | ONNX Opset: 14
---- 2 Graph Input(s) ----
{input_ids [dtype=int64, shape=('batch_size', 'sequence_length')],
attention_mask [dtype=int64, shape=('batch_size', 'sequence_length')]}
---- 1 Graph Output(s) ----
{logits [dtype=float32, shape=('batch_size', 2)]}
---- 174 Initializer(s) ----
---- 4152 Node(s) ----
polygraphy run model.onnx --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt
[I] onnxrt-runner-N0-03/31/25-03:53:39 | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-03:53:39
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39
---- Inference Output(s) ----
{logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:53:39 | Completed 1 iteration(s) in 34.52 ms | Average inference time: 34.52 ms.
[I] PASSED | Runtime: 4.336s | Command: /usr/local/bin/polygraphy run model.onnx –onnxrt
polygraphy run model.onnx --trt --onnxrt
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-03:52:36 | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
Profile 0:
{input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
Flags | []
Engine Capability | EngineCapability.STANDARD
Memory Pools | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
Tactic Sources | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
Profiling Verbosity | ProfilingVerbosity.DETAILED
Preview Features | [PROFILE_SHARING_0806]
[I] Finished engine building in 11.023 seconds
[I] trt-runner-N0-03/31/25-03:52:36
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-03:52:36
---- Inference Output(s) ----
{logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-03:52:36 | Completed 1 iteration(s) in 16.38 ms | Average inference time: 16.38 ms.
[I] onnxrt-runner-N0-03/31/25-03:52:36 | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-03/31/25-03:52:36
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36
---- Inference Output(s) ----
{logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-03:52:36 | Completed 1 iteration(s) in 31.44 ms | Average inference time: 31.44 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36
[I] Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I] Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I] trt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I] onnxrt-runner-N0-03/31/25-03:52:36: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I] Error Metrics: logits
[I] Minimum Required Tolerance: elemwise error | [abs=3.9339e-06] OR [rel=2.622e-06] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=3.6955e-06, std-dev=2.3842e-07, var=5.6843e-14, median=3.6955e-06, min=3.4571e-06 at (0, 0), max=3.9339e-06 at (0, 1), avg-magnitude=3.6955e-06, p90=3.8862e-06, p95=3.9101e-06, p99=3.9291e-06
[I] Relative Difference | Stats: mean=2.369e-06, std-dev=2.5293e-07, var=6.3972e-14, median=2.369e-06, min=2.1161e-06 at (0, 0), max=2.622e-06 at (0, 1), avg-magnitude=2.369e-06, p90=2.5714e-06, p95=2.5967e-06, p99=2.6169e-06
[I] PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I] PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-03:52:36 vs. onnxrt-runner-N0-03/31/25-03:52:36 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 21.399s | Command: /usr/local/bin/polygraphy run model.onnx --trt –onnxrt
polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda
[I] onnxrt-runner-N0-03/31/25-04:15:38 | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:15:40.686638615 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:15:40.697494382 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:15:40.697516462 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[W] Input tensor: input_ids [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
If this is incorrect, please provide a custom data loader.
[W] Input tensor: attention_mask [shape=BoundedShape(['batch_size', 'sequence_length'], min=None, max=None)] | Will generate data of shape: [1, 1].
If this is incorrect, please provide a custom data loader.
[I] onnxrt-runner-N0-03/31/25-04:15:38
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38
---- Inference Output(s) ----
{logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:15:38 | Completed 1 iteration(s) in 145.6 ms | Average inference time: 145.6 ms.
[I] PASSED | Runtime: 2.972s | Command: /usr/local/bin/polygraphy run model.onnx --onnxrt --execution-providers=cuda
polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] RUNNING | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] trt-runner-N0-03/31/25-04:14:34 | Activating and starting inference
[W] ModelImporter.cpp:459: Make sure input input_ids has Int64 binding.
[W] ModelImporter.cpp:459: Make sure input attention_mask has Int64 binding.
[W] Input tensor: input_ids (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: attention_mask (dtype=DataType.INT64, shape=(-1, -1)) | No shapes provided; Will use shape: [1, 1] for min/opt/max in profile.
[I] Configuring with profiles:[
Profile 0:
{input_ids [min=[1, 1], opt=[1, 1], max=[1, 1]],
attention_mask [min=[1, 1], opt=[1, 1], max=[1, 1]]}
]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
Flags | []
Engine Capability | EngineCapability.STANDARD
Memory Pools | [WORKSPACE: 22723.50 MiB, TACTIC_DRAM: 22723.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
Tactic Sources | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
Profiling Verbosity | ProfilingVerbosity.DETAILED
Preview Features | [PROFILE_SHARING_0806]
[I] Finished engine building in 10.985 seconds
[I] trt-runner-N0-03/31/25-04:14:34
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)]}
[I] trt-runner-N0-03/31/25-04:14:34
---- Inference Output(s) ----
{logits [dtype=float32, shape=(1, 2)]}
[I] trt-runner-N0-03/31/25-04:14:34 | Completed 1 iteration(s) in 16.65 ms | Average inference time: 16.65 ms.
[I] onnxrt-runner-N0-03/31/25-04:14:34 | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CUDAExecutionProvider']
2025-03-31 04:14:53.591558834 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 28 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-03-31 04:14:53.603006616 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-03-31 04:14:53.603031695 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[I] onnxrt-runner-N0-03/31/25-04:14:34
---- Inference Input(s) ----
{input_ids [dtype=int64, shape=(1, 1)],
attention_mask [dtype=int64, shape=(1, 1)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34
---- Inference Output(s) ----
{logits [dtype=float32, shape=(1, 2)]}
[I] onnxrt-runner-N0-03/31/25-04:14:34 | Completed 1 iteration(s) in 68.37 ms | Average inference time: 68.37 ms.
[I] Accuracy Comparison | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34
[I] Comparing Output: 'logits' (dtype=float32, shape=(1, 2)) with 'logits' (dtype=float32, shape=(1, 2))
[I] Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I] trt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066662, std-dev=1.567, var=2.4556, median=-0.066662, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I] onnxrt-runner-N0-03/31/25-04:14:34: logits | Stats: mean=-0.066663, std-dev=1.567, var=2.4556, median=-0.066663, min=-1.6337 at (0, 0), max=1.5004 at (0, 1), avg-magnitude=1.567, p90=1.187, p95=1.3437, p99=1.469
[I] Error Metrics: logits
[I] Minimum Required Tolerance: elemwise error | [abs=2.0266e-06] OR [rel=1.3507e-06] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=1.5497e-06, std-dev=4.7684e-07, var=2.2737e-13, median=1.5497e-06, min=1.0729e-06 at (0, 0), max=2.0266e-06 at (0, 1), avg-magnitude=1.5497e-06, p90=1.9312e-06, p95=1.9789e-06, p99=2.017e-06
[I] Relative Difference | Stats: mean=1.0037e-06, std-dev=3.4699e-07, var=1.204e-13, median=1.0037e-06, min=6.5672e-07 at (0, 0), max=1.3507e-06 at (0, 1), avg-magnitude=1.0037e-06, p90=1.2813e-06, p95=1.316e-06, p99=1.3438e-06
[I] PASSED | Output: 'logits' | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I] PASSED | All outputs matched | Outputs: ['logits']
[I] Accuracy Summary | trt-runner-N0-03/31/25-04:14:34 vs. onnxrt-runner-N0-03/31/25-04:14:34 | Passed: 1/1 iterations | Pass Rate: 100.0%
[I] PASSED | Runtime: 19.832s | Command: /usr/local/bin/polygraphy run model.onnx --trt --onnxrt --execution-providers=cuda
pip list
Package Version
------------------------ -------------
annotated-types 0.7.0
anyio 4.8.0
astunparse 1.6.3
blinker 1.7.0
certifi 2025.1.31
colored 2.3.0
coloredlogs 15.0.1
cryptography 41.0.7
cupy-cuda12x 13.3.0
dbus-python 1.3.2
distlib 0.3.9
distro 1.9.0
dm-tree 0.1.8
fastapi 0.115.6
fastrlock 0.8.3
filelock 3.17.0
flatbuffers 25.2.10
gast 0.6.0
h11 0.14.0
httpcore 1.0.7
httplib2 0.20.4
httpx 0.27.2
humanfriendly 10.0
idna 3.10
iniconfig 2.0.0
jiter 0.8.2
launchpadlib 1.11.0
lazr.restfulclient 0.14.6
lazr.uri 1.0.6
mpmath 1.3.0
numpy 1.26.4
nvidia-cuda-runtime-cu12 12.8.90
nvidia-dali-cuda120 1.44.0
nvidia-nvimgcodec-cu12 0.3.0.5
oauthlib 3.2.2
onnx 1.17.0
onnxruntime-gpu 1.21.0
openai 1.60.0
packaging 24.2
pip 24.0
platformdirs 4.3.6
pluggy 1.5.0
polygraphy 0.49.20
protobuf 6.30.2
pydantic 2.10.6
pydantic_core 2.27.2
PyGObject 3.48.2
PyJWT 2.7.0
pyparsing 3.1.1
pytest 8.3.4
python-apt 2.7.7+ubuntu4
setuptools 68.1.2
six 1.16.0
sniffio 1.3.1
starlette 0.41.3
sympy 1.13.3
tensorrt 10.9.0.34
tensorrt-cu12 10.9.0.34
tensorrt_cu12_bindings 10.9.0.34
tensorrt_cu12_libs 10.9.0.34
tqdm 4.67.1
tritonfrontend 2.55.0
tritonserver 0.0.0
typing_extensions 4.12.2
virtualenv 20.29.2
wadllib 1.3.6
wheel 0.45.1
cc @lix19937