The provided bash scripts show an example to deploy and run the quantized LLama 3.1 8B Instruct FP8 model from Nvidia's Hugging Face model hub on TensorRT-LLM, vLLM and SGLang respectively.
Before running the bash scripts, please make sure you have setup the environment properly:
- Make sure you are authenticated with a Hugging Face account to interact with the Hub, e.g., use
huggingface-cli login
to save the access token on your machine. - Install TensorRT-LLM properly by following instructions here.
- Install vLLM properly by following instructions here.
- Install SGLang properly by following instructions here
Then, to deploy and run on TensorRT-LLM:
python run_llama_fp8_trtllm.py
To deploy and run on vLLM:
python run_llama_fp8_vllm.py
To deploy and run on SGLang:
python run_llama_fp8_sglang.py
If you want to run post-training quantization with TensorRT Model Optimizer for your selected models, check here.