Name	Name	Last commit message	Last commit date
parent directory ..
distil_whisper	distil_whisper
README.md	README.md
build.py	build.py
requirements.txt	requirements.txt
run.py	run.py
run_faster_whisper.py	run_faster_whisper.py
tokenizer.py	tokenizer.py
weight.py	weight.py
whisper_utils.py	whisper_utils.py

Whisper

This document shows how to build and run a whisper model in TensorRT-LLM on a single GPU.

Whisper

Overview

The TensorRT-LLM Whisper example code is located in examples/whisper. There are three main files in that folder:

build.py to build the TensorRT engine(s) needed to run the Whisper model.
run.py to run the inference on a single wav file, or a HuggingFace dataset (Librispeech test clean).
run_faster_whisper.py to do benchmark comparison with Faster Whisper.

Support Matrix

FP16
INT8 (Weight Only Quant)

Usage

The TensorRT-LLM Whisper example code locates at examples/whisper. It takes whisper pytorch weights as input, and builds the corresponding TensorRT engines.

Build TensorRT engine(s)

Need to prepare the whisper checkpoint first by downloading models from here.

wget --directory-prefix=assets https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/multilingual.tiktoken
wget --directory-prefix=assets assets/mel_filters.npz https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz
wget --directory-prefix=assets https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav
# take large-v3 model as an example
wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt

TensorRT-LLM Whisper builds TensorRT engine(s) from the pytorch checkpoint.

# install requirements first
pip install -r requirements.txt

# Build the large-v3 model using a single GPU with plugins.
python3 build.py --output_dir whisper_large_v3 --use_gpt_attention_plugin --use_gemm_plugin  --use_bert_attention_plugin --enable_context_fmha

# Build the large-v3 model using a single GPU with plugins and int8 weight-only quantization.
python3 build.py --output_dir whisper_large_v3_weight_only --use_gpt_attention_plugin --use_gemm_plugin  --use_bert_attention_plugin --enable_context_fmha --use_weight_only

Run

# choose the engine you build [./whisper_large_v3, ./whisper_large_weight_only]
output_dir=./whisper_large_v3
# decode a single audio file
# If the input file does not have a .wav extension, ffmpeg needs to be installed with the following command:
# apt-get update && apt-get install -y ffmpeg
python3 run.py --name single_wav_test --engine_dir $output_dir --input_file assets/1221-135766-0002.wav
# decode a whole dataset
python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeech_asr_dummy --enable_warmup --name librispeech_dummy_large_v3_plugin

Distil-Whisper

TensorRT-LLM also supports using distil-whisper's different models by first converting their params and weights from huggingface's naming format to openai whisper naming format. You can do so by running the script distil_whisper/convert_from_distil_whisper.py as follows:

# take distil-medium.en as an example
# download the gpt2.tiktoken
wget --directory-prefix=assets https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/gpt2.tiktoken

# will download the model weights from huggingface and convert them to openai-whisper's pytorch format
# model is saved to ./assets/ by default
python3 distil_whisper/convert_from_distil_whisper.py --model_name distil-whisper/distil-medium.en --output_name distil-medium.en

# now we can build and run the model like before:
output_dir=distil_whisper_medium_en
python3 build.py --model_name distil-medium.en --output_dir $output_dir --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --enable_context_fmha

python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeech_asr_dummy --name librispeech_dummy_${output_dir} --tokenizer_name gpt2

Acknowledgment

This implementation of TensorRT-LLM for Whisper has been adapted from the NVIDIA TensorRT-LLM Hackathon 2023 submission of Jinheng Wang, which can be found in the repository Eddie-Wang-Hackathon2023 on GitHub. We extend our gratitude to Jinheng for providing a foundation for the implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper

whisper

README.md

Whisper

Overview

Support Matrix

Usage

Build TensorRT engine(s)

Run

Distil-Whisper

Acknowledgment

Files

whisper

Directory actions

More options

Directory actions

More options

Latest commit

History

whisper

Folders and files

parent directory

README.md

Whisper

Overview

Support Matrix

Usage

Build TensorRT engine(s)

Run

Distil-Whisper

Acknowledgment