Support for W4A16 and W4A8 Quantization in TensorRT Model Optimizer #189

david-PHR · 2025-04-30T15:50:06Z

Hello NVIDIA team,

I noticed that the TensorRT Model Optimizer currently supports W4A16 and W4A8 quantization configurations, as detailed in your quantization configuration documentation.
nvidia.github.io

However, according to the Best Practices for Choosing Quantization Methods, these configurations are currently deployable only via TensorRT-LLM.
nvidia.github.io

I would like to inquire if there are plans to extend support for W4A16 and W4A8 quantization to the standard TensorRT backend, beyond just TensorRT-LLM.

Such support would be highly beneficial for deploying models in environments where only TensorRT is used and not TensorRT-LLM
Thank you for your continued efforts in optimizing model deployment workflows.!

david-PHR added the feature request New feature or request label Apr 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for W4A16 and W4A8 Quantization in TensorRT Model Optimizer #189

Support for W4A16 and W4A8 Quantization in TensorRT Model Optimizer #189

david-PHR commented Apr 30, 2025

Support for W4A16 and W4A8 Quantization in TensorRT Model Optimizer #189

Support for W4A16 and W4A8 Quantization in TensorRT Model Optimizer #189

Comments

david-PHR commented Apr 30, 2025