Skip to content

Latest commit

 

History

History
67 lines (43 loc) · 3.16 KB

README.md

File metadata and controls

67 lines (43 loc) · 3.16 KB

BERT and BERT Variants

This document explains how to build the BERT family, specifically BERT and RoBERTa model using TensorRT-LLM. It also describes how to run on a single GPU and two GPUs.

Overview

The TensorRT-LLM BERT family implementation can be found in tensorrt_llm/models/bert/model.py. The TensorRT-LLM BERT family example code is located in examples/bert. There are two main files in that folder:

  • build.py to build the TensorRT engine(s) needed to run the model,
  • run.py to run the inference on an input text,

Build and run on a single GPU

TensorRT-LLM converts HuggingFace BERT family models into TensorRT engine(s). To build the TensorRT engine, use:

python3 build.py [--model <model_name> --dtype <data_type> ...]

Supported model_name options include: BertModel, BertForQuestionAnswering, BertForSequenceClassification, RobertaModel, RobertaForQuestionAnswering, and RobertaForSequenceClassification, with BertModel as the default.

Some examples are as follows:

# Build BertModel
python3 build.py --model BertModel --dtype=float16 --log_level=verbose

# Build RobertaModel
python3 build.py --model RobertaModel --dtype=float16 --log_level=verbose

# Build BertModel with TensorRT-LLM BERT Attention plugin for enhanced runtime performance
python3 build.py --dtype=float16 --log_level=verbose --use_bert_attention_plugin float16

# Build RobertaForSequenceClassification with half-precision accumulation for attention BMM1 (applied to unfused MHA plugins)
python3 build.py --model RobertaForSequenceClassification --dtype=float16 --log_level=verbose --use_bert_attention_plugin float16 --enable_qk_half_accum

The following command can be used to run the model on a single GPU:

python3 run.py

Fused MultiHead Attention (FMHA)

You can enable the FMHA kernels for BERT by adding --enable_context_fmha to the invocation of build.py. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.

If you find that the default fp16 accumulation (--enable_context_fmha) cannot meet the requirement, you can try to enable fp32 accumulation by adding --enable_context_fmha_fp32_acc. However, it is expected to see performance drop.

Note --enable_context_fmha / --enable_context_fmha_fp32_acc has to be used together with --use_bert_attention_plugin float16.

Build and run on two GPUs

The following two commands can be used to build TensorRT engines to run BERT on two GPUs. The first command builds one engine for the first GPU. The second command builds another engine for the second GPU. For example, to build BertForQuestionAnswering with two GPUs, run:

python3 build.py --model BertForQuestionAnswering --world_size=2 --rank=0
python3 build.py --model BertForQuestionAnswering --world_size=2 --rank=1

The following command can be used to run the inference on 2 GPUs. It uses MPI with mpirun.

mpirun -n 2 python3 run.py