Skip to content

Latest commit

 

History

History
 
 

falcon

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Falcon

This document shows how to build and run a Falcon model in TensorRT-LLM on single GPU, single node multi-GPU, and multi-node multi-GPU.

Overview

The TensorRT-LLM Falcon implementation can be found in tensorrt_llm/models/falcon/model.py. The TensorRT-LLM Falcon example code is located in examples/falcon. There is one main file:

In addition, there are two shared files in the parent folder examples for inference and evaluation:

Support Matrix

  • FP16
  • BF16
  • FP8
  • FP8 KV CACHE
  • Groupwise quantization (AWQ)
  • Tensor Parallel
  • STRONGLY TYPED

Usage

The next two sections describe how to convert the weights from the HuggingFace (HF) Transformers format to the TensorRT-LLM format.

1. Download weights from HuggingFace Transformers

Install the dependency packages and setup git-lfs.

# Install dependencies
pip install -r requirements.txt

# Setup git-lfs
git lfs install

There are four HF checkpoints available. Use one of the following commands to fetch the checkpoint you are interested in. Follow the guides here https://huggingface.co/docs/transformers/main/en/model_doc/falcon.

# falcon-rw-1b
git clone https://huggingface.co/tiiuae/falcon-rw-1b falcon/rw-1b

# falcon-7b-instruct
git clone https://huggingface.co/tiiuae/falcon-7b-instruct falcon/7b-instruct

# falcon-40b-instruct
git clone https://huggingface.co/tiiuae/falcon-40b-instruct falcon/40b-instruct

# falcon-180b
git clone https://huggingface.co/tiiuae/falcon-180B falcon/180b

2. Convert weights from HF Transformers to TensorRT-LLM format

The convert_checkpoint.py script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.

# falcon-rw-1b: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir ./falcon/rw-1b \
                --dtype float16 \
                --output_dir ./falcon/rw-1b/trt_ckpt/fp16/1-gpu/

# falcon-7b-instruct: single gpu, dtype bfloat16
python3 convert_checkpoint.py --model_dir ./falcon/7b-instruct \
                --dtype bfloat16 \
                --output_dir ./falcon/7b-instruct/trt_ckpt/bf16/1-gpu/

# falcon-40b-instruct: 2-way tensor parallelism
python3 convert_checkpoint.py --model_dir ./falcon/40b-instruct \
                --dtype bfloat16 \
                --output_dir ./falcon/40b-instruct/trt_ckpt/bf16/tp2-pp1/ \
                --tp_size 2

# falcon-40b-instruct: 2-way tensor parallelism and 2-way pipeline parallelism
python3 convert_checkpoint.py --model_dir ./falcon/40b-instruct \
                --dtype bfloat16 \
                --output_dir ./falcon/40b-instruct/trt_ckpt/bf16/tp2-pp2/ \
                --tp_size 2 \
                --pp_size 2

# falcon-180b: 8-way tensor parallelism, loading weights shard-by-shard
python3 convert_checkpoint.py --model_dir ./falcon/180b \
                --dtype bfloat16 \
                --output_dir ./falcon/180b/trt_ckpt/bf16/tp8-pp1/ \
                --tp_size 8 \
                --load_by_shard \
                --workers 8

# falcon-180b: 4-way tensor parallelism and 2-way pipeline parallelism, loading weights shard-by-shard
python3 convert_checkpoint.py --model_dir ./falcon/180b \
                --dtype bfloat16 \
                --output_dir ./falcon/180b/trt_ckpt/bf16/tp4-pp2/ \
                --tp_size 4 \
                --pp_size 2 \
                --load_by_shard \
                --workers 8

Note that in order to use N-way tensor parallelism, the number of attention heads must be a multiple of N. For example, you can't configure 2-way tensor parallelism for falcon-7b or falcon-7b-instruct, because the number of attention heads is 71 (not divisible by 2).

3. Build TensorRT engine(s)

The trtllm-build command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.

Normally, the trtllm-build command only requires a single GPU, but you can enable parallel building by passing the number of GPUs to the --workers argument.

# falcon-rw-1b
trtllm-build --checkpoint_dir ./falcon/rw-1b/trt_ckpt/fp16/1-gpu/ \
                --gemm_plugin float16 \
                --output_dir ./falcon/rw-1b/trt_engines/fp16/1-gpu/

# falcon-7b-instruct
# Enabling --gpt_attention_plugin is necessary for rotary positional embedding (RoPE)
trtllm-build --checkpoint_dir ./falcon/7b-instruct/trt_ckpt/bf16/1-gpu/ \
                --gemm_plugin bfloat16 \
                --remove_input_padding enable \
                --gpt_attention_plugin bfloat16 \
                --output_dir ./falcon/7b-instruct/trt_engines/bf16/1-gpu/

# falcon-40b-instruct: 2-way tensor parallelism
trtllm-build --checkpoint_dir ./falcon/40b-instruct/trt_ckpt/bf16/tp2-pp1/ \
                --gemm_plugin bfloat16 \
                --gpt_attention_plugin bfloat16 \
                --output_dir ./falcon/40b-instruct/trt_engines/bf16/tp2-pp1/

# falcon-40b-instruct: 2-way tensor parallelism and 2-way pipeline parallelism
trtllm-build --checkpoint_dir ./falcon/40b-instruct/trt_ckpt/bf16/tp2-pp2/ \
                --gemm_plugin bfloat16 \
                --gpt_attention_plugin bfloat16 \
                --output_dir ./falcon/40b-instruct/trt_engines/bf16/tp2-pp2/

# falcon-180b: 8-way tensor parallelism
trtllm-build --checkpoint_dir ./falcon/180b/trt_ckpt/bf16/tp8-pp1/ \
                --gemm_plugin bfloat16 \
                --gpt_attention_plugin bfloat16 \
                --output_dir ./falcon/180b/trt_engines/bf16/tp8-pp1/ \
                --workers 8

# falcon-180b: 4-way tensor parallelism and 2-way pipeline parallelism
trtllm-build --checkpoint_dir ./falcon/180b/trt_ckpt/bf16/tp4-pp2/ \
                --gemm_plugin bfloat16 \
                --gpt_attention_plugin bfloat16 \
                --output_dir ./falcon/180b/trt_engines/bf16/tp4-pp2/ \
                --workers 8

If the engines are built successfully, you will see output like (falcon-rw-1b as the example):

......
[12/27/2023-03:46:29] [TRT] [I] Engine generation completed in 35.0677 seconds.
[12/27/2023-03:46:29] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 393 MiB, GPU 2699 MiB
[12/27/2023-03:46:29] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +2699, now: CPU 0, GPU 2699 (MiB)
[12/27/2023-03:46:29] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 10624 MiB
[12/27/2023-03:46:29] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:36
[12/27/2023-03:46:31] [TRT-LLM] [I] Serializing engine to ./falcon/rw-1b/trt_engines/fp16/1-gpu/rank0.engine...
[12/27/2023-03:46:59] [TRT-LLM] [I] Engine serialized. Total time: 00:00:28
[12/27/2023-03:46:59] [TRT-LLM] [I] Total time of building all engines: 00:01:59

4. Run summarization task with the TensorRT engine(s)

The ../summarize.py script can run the built engines to summarize the articles from the cnn_dailymail dataset.

# falcon-rw-1b
python ../summarize.py --test_trt_llm \
                       --hf_model_dir ./falcon/rw-1b \
                       --engine_dir ./falcon/rw-1b/trt_engines/fp16/1-gpu/

# falcon-7b-instruct
python ../summarize.py --test_trt_llm \
                       --hf_model_dir ./falcon/7b-instruct \
                       --engine_dir ./falcon/7b-instruct/trt_engines/bf16/1-gpu/

# falcon-40b-instruct: 2-way tensor parallelism
mpirun -n 2 --allow-run-as-root --oversubscribe \
    python ../summarize.py --test_trt_llm \
                           --hf_model_dir ./falcon/40b-instruct \
                           --engine_dir ./falcon/40b-instruct/trt_engines/bf16/tp2-pp1/

# falcon-40b-instruct: 2-way tensor parallelism and 2-way pipeline parallelism
mpirun -n 4 --allow-run-as-root --oversubscribe \
    python ../summarize.py --test_trt_llm \
                           --hf_model_dir ./falcon/40b-instruct \
                           --engine_dir ./falcon/40b-instruct/trt_engines/bf16/tp2-pp2/

# falcon-180b: 8-way tensor parallelism
mpirun -n 8 --allow-run-as-root --oversubscribe \
    python ../summarize.py --test_trt_llm \
                           --hf_model_dir ./falcon/180b \
                           --engine_dir ./falcon/180b/trt_engines/bf16/tp8-pp1/

# falcon-180b: 4-way tensor parallelism and 2-way pipeline parallelism
mpirun -n 8 --allow-run-as-root --oversubscribe \
    python ../summarize.py --test_trt_llm \
                           --hf_model_dir ./falcon/180b \
                           --engine_dir ./falcon/180b/trt_engines/bf16/tp4-pp2/

If the engines are run successfully, you will see output like (falcon-rw-1b as the example):

......
[12/27/2023-03:57:02] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.816917419433594 sec)
[12/27/2023-03:57:02] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[12/27/2023-03:57:02] [TRT-LLM] [I]   rouge1 : 15.061493342516243
[12/27/2023-03:57:02] [TRT-LLM] [I]   rouge2 : 4.495335888974063
[12/27/2023-03:57:02] [TRT-LLM] [I]   rougeL : 11.800002670828547
[12/27/2023-03:57:02] [TRT-LLM] [I]   rougeLsum : 13.458777656925877

FP8 Post-Training Quantization

The examples below use the NVIDIA AMMO (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure AMMO toolkit is installed (see examples/quantization/README.md)

Now quantize HF Falcon weights and export trtllm checkpoint.

# Quantize HF Falcon 180B checkpoint into FP8 and export trtllm checkpoint
python ../quantization/quantize.py --model_dir ./falcon/180b \
                --dtype float16 \
                --qformat fp8 \
                --kv_cache_dtype fp8 \
                --output_dir ./falcon/180b/trt_ckpt/fp8/tp8-pp1 \
                --tp_size 8

# Build trtllm engines from the trtllm checkpoint
trtllm-build --checkpoint_dir ./falcon/180b/trt_ckpt/fp8/tp8-pp1 \
                --gemm_plugin float16 \
                --strongly_typed \
                --output_dir ./falcon/180b/trt_engines/fp8/tp8-pp1 \
                --workers 8

# Run the summarization task
mpirun -n 8 --allow-run-as-root --oversubscribe \
    python ../summarize.py --test_trt_llm \
                --hf_model_dir ./falcon/180b \
                --engine_dir ./falcon/180b/trt_engines/fp8/tp8-pp1

Note that you can enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable when building the engines.

Groupwise quantization (AWQ)

The examples below use the NVIDIA AMMO (AlgorithMic Model Optimization) toolkit for the model quantization process.

First make sure AMMO toolkit is installed (see examples/quantization/README.md)

Now quantize HF Falcon weights and export trtllm checkpoint.

# Quantize HF Falcon 180B checkpoint into INT4-AWQ and export trtllm checkpoint
python ../quantization/quantize.py --model_dir ./falcon/180b \
                --dtype float16 \
                --qformat int4_awq \
                --output_dir ./falcon/180b/trt_ckpt/int4_awq/tp2 \
                --tp_size 2

# Build trtllm engines from the trtllm checkpoint
trtllm-build --checkpoint_dir ./falcon/180b/trt_ckpt/int4_awq/tp2 \
                --gemm_plugin float16 \
                --output_dir ./falcon/180b/trt_engines/int4_awq/tp2 \
                --workers 2

# Run the summarization task
mpirun -n 2 --allow-run-as-root --oversubscribe \
    python ../summarize.py --test_trt_llm \
                --hf_model_dir ./falcon/180b \
                --engine_dir ./falcon/180b/trt_engines/int4_awq/tp2

W4A16 AWQ with FP8 GEMM (W4A8 AWQ)

For Hopper GPUs, TRT-LLM also supports employing FP8 GEMM for accelerating linear layers. This mode is noted with w4a8_awq for AMMO and TRT-LLM, in which both weights and activations are converted from W4A16 to FP8 for GEMM calculation.

Please make sure your system contains a Hopper GPU before trying the commands below.

# Quantize HF Falcon 180B checkpoint into W4A8-AWQ and export trtllm checkpoint
python ../quantization/quantize.py --model_dir ./falcon/180b \
                --dtype float16 \
                --qformat w4a8_awq \
                --output_dir ./falcon/180b/trt_ckpt/w4a8_awq/tp2 \
                --tp_size 2

# Build trtllm engines from the trtllm checkpoint
trtllm-build --checkpoint_dir ./falcon/180b/trt_ckpt/w4a8_awq/tp2 \
                --gemm_plugin float16 \
                --output_dir ./falcon/180b/trt_engines/w4a8_awq/tp2 \
                --workers 2

# Run the summarization task
mpirun -n 2 --allow-run-as-root --oversubscribe \
    python ../summarize.py --test_trt_llm \
                --hf_model_dir ./falcon/180b \
                --engine_dir ./falcon/180b/trt_engines/w4a8_awq/tp2

Troubleshooting

1. The HuggingFace Falcon may raise an error when using the accelerate package.

One may find the following message.

Traceback (most recent call last):
  File "build.py", line 10, in <module>
    from transformers import FalconConfig, FalconForCausalLM
  File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py", line 1090, in __getattr__
    value = getattr(module, name)
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py", line 1089, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py", line 1101, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.falcon.modeling_falcon because of the following error (look up to see its traceback):

It may be resolved by pinning the version of typing-extensions package by 4.5.0.

pip install typing-extensions==4.5.0