Skip to content

Support for W4A16 and W4A8 Quantization in TensorRT Model Optimizer #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
david-PHR opened this issue Apr 30, 2025 · 0 comments
Open
Labels
feature request New feature or request

Comments

@david-PHR
Copy link

Hello NVIDIA team,​

I noticed that the TensorRT Model Optimizer currently supports W4A16 and W4A8 quantization configurations, as detailed in your quantization configuration documentation.​
nvidia.github.io

However, according to the Best Practices for Choosing Quantization Methods, these configurations are currently deployable only via TensorRT-LLM.​
nvidia.github.io

I would like to inquire if there are plans to extend support for W4A16 and W4A8 quantization to the standard TensorRT backend, beyond just TensorRT-LLM.​

Such support would be highly beneficial for deploying models in environments where only TensorRT is used and not TensorRT-LLM
Thank you for your continued efforts in optimizing model deployment workflows.!

@david-PHR david-PHR added the feature request New feature or request label Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant