Skip to content

Latest commit

 

History

History

model_hub

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Deploy quantized models from Nvidia's Hugging Face model hub with TensorRT-LLM, vLLM, and SGLang

The provided bash scripts show an example to deploy and run the quantized LLama 3.1 8B Instruct FP8 model from Nvidia's Hugging Face model hub on TensorRT-LLM, vLLM and SGLang respectively.

Before running the bash scripts, please make sure you have setup the environment properly:

  • Make sure you are authenticated with a Hugging Face account to interact with the Hub, e.g., use huggingface-cli login to save the access token on your machine.
  • Install TensorRT-LLM properly by following instructions here.
  • Install vLLM properly by following instructions here.
  • Install SGLang properly by following instructions here

Then, to deploy and run on TensorRT-LLM:

python run_llama_fp8_trtllm.py

To deploy and run on vLLM:

python run_llama_fp8_vllm.py

To deploy and run on SGLang:

python run_llama_fp8_sglang.py

If you want to run post-training quantization with TensorRT Model Optimizer for your selected models, check here.