diff --git a/Conceptual_Guide/Part_2-improving_resource_utilization/README.md b/Conceptual_Guide/Part_2-improving_resource_utilization/README.md index 8bfe5b7a..a9485071 100644 --- a/Conceptual_Guide/Part_2-improving_resource_utilization/README.md +++ b/Conceptual_Guide/Part_2-improving_resource_utilization/README.md @@ -39,12 +39,12 @@ Part-1 of this series introduced the mechanisms to set up a Triton Inference Ser Dynamic batching, in reference to the Triton Inference Server, refers to the functionality which allows the combining of one or more inference requests into a single batch (which has to be created dynamically) to maximize throughput. Dynamic batching can be enabled and configured on per model basis by specifying selections in the model's `config.pbtxt`. Dynamic Batching can be enabled with its default settings by adding the following to the `config.pbtxt` file: -``` +```text proto dynamic_batching { } ``` While Triton batches these incoming requests without any delay, users can choose to allocate a limited delay for the scheduler to collect more inference requests to be used by the dynamic batcher. -``` +```text proto dynamic_batching { max_queue_delay_microseconds: 100 } @@ -65,7 +65,7 @@ As observed from the above, the use of Dynamic Batching can lead to improvements The Triton Inference Server can spin up multiple instances of the same model, which can process queries in parallel. Triton can spawn instances on the same device (GPU), or a different device on the same node as per the user's specifications. This customizability is especially useful when considering ensembles that have models with different throughputs. Multiple copies of heavier models can be spawned on a separate GPU to allow for more parallel processing. This is enabled via the use of `instance groups` option in a model's configuration. -``` +```text proto instance_group [ { count: 2 @@ -90,13 +90,13 @@ This section showcases the use of dynamic batching and concurrent model executio ### Getting access to the model Let's use the `text recognition` used in part 1. We do need to make some minor changes in the model, namely making the 0th axes of the model have dynamic shape to enable batching. Step 1, download the Text Recognition model weights. Use the NGC PyTorch container as the environment for the following. -``` +```bash docker run -it --gpus all -v ${PWD}:/scratch nvcr.io/nvidia/pytorch:-py3 cd /scratch wget https://www.dropbox.com/sh/j3xmli4di1zuv3s/AABzCC1KGbIRe2wRwa3diWKwa/None-ResNet-None-CTC.pth ``` Export the models as `.onnx` using the file in the `utils` folder. This file is adapted from [Baek et. al. 2019](https://github.com/clovaai/deep-text-recognition-benchmark). -``` +```python import torch from utils.model import STRModel @@ -116,7 +116,7 @@ torch.onnx.export(model, trace_input, "str.onnx", verbose=True, dynamic_axes={'i ### Launching the server As discussed in `Part 1`, a model repository is a filesystem based repository of models and configuration schema used by the Triton Inference Server (refer to `Part 1` for a more detailed explanation for model repositories). For this example, the model repository structure would need to be set up in the following manner: -``` +```text model_repository | |-- text_recognition @@ -128,7 +128,7 @@ model_repository ``` This repository is a subset from the previous example. The key difference in this set up is the use of `instance_group`(s) and `dynamic_batching` in the model configuration. The additions are as follows: -``` +```text proto instance_group [ { count: 2 @@ -142,7 +142,7 @@ With `instance_group` users can primarily tweak two things. First, the number of Adding `dynamic_batching {}` will enable the use of dynamic batches. Users can also add `preferred_batch_size` and `max_queue_delay_microseconds` in the body of dynamic batching to manage more efficient batching per their use case. Explore the [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-configuration) documentation for more information. With the model repository set up, the Triton Inference Server can be launched. -``` +```bash docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash tritonserver --model-repository=/models @@ -151,11 +151,11 @@ tritonserver --model-repository=/models ### Measuring Performance Having made some improvements to the model's serving capabilities by enabling `dynamic batching` and the use of `multiple model instances`, the next step is to measure the impact of these features. To that end, the Triton Inference Server comes packaged with the [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md) which is a tool specifically designed to measure performance for Triton Inference Servers. For ease of use, it is recommended that users run this inside the same container used to run client code in Part 1 of this series. -``` +```bash docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash ``` On a third terminal, it is advisable to monitor the GPU Utilization to see if the deployment is saturating GPU resources. -``` +```bash watch -n0.1 nvidia-smi ``` @@ -163,7 +163,7 @@ To measure the performance gain, let's run performance analyzer on the following * **No Dynamic Batching, single model instance**: This configuration will be the baseline measurement. To set up the Triton Server in this configuration, do not add `instance_group` or `dynamic_batching` in `config.pbtxt` and make sure to include `--gpus=1` in the `docker run` command to set up the server. -``` +```bash # perf_analyzer -m -b --shape : --concurrency-range :: # Query @@ -198,7 +198,7 @@ Request concurrency: 16 ``` * **Just Dynamic Batching**: To set up the Triton Server in this configuration, add `dynamic_batching` in `config.pbtxt`. -``` +```bash # Query perf_analyzer -m text_recognition -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95 @@ -233,7 +233,7 @@ As each of the requests had a batch size (of 2), while the maximum batch size of * **Dynamic Batching with multiple model instances**: To set up the Triton Server in this configuration, add `instance_group` in `config.pbtxt` and make sure to include `--gpus=1` and make sure to include `--gpus=1` in the `docker run` command to set up the server. Include `dynamic_batching` per instructions of the previous section in the model configuration. A point to note is that peak GPU utilization on the GPU shot up to 74% (A100 in this case) while just using a single model instance with dynamic batching. Adding one more instance will definitely improve performance but linear perf scaling will not be achieved in this case. -``` +```bash # Query perf_analyzer -m text_recognition -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95 diff --git a/Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md b/Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md index 9540f057..0e8ed6a8 100644 --- a/Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md +++ b/Conceptual_Guide/Part_3-optimizing_triton_configuration/README.md @@ -70,7 +70,7 @@ With Model Analyzer users can: Refer to Part 2 of this series to get access to the models. Refer to the Model Analyzer [installation guide](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md#recommended-installation-method) for more information about installing Model Analyzer. For ease of following along, use these commands to install model analyzer: -``` +```bash sudo apt-get update && sudo apt-get install python3-pip sudo apt-get update && sudo apt-get install wkhtmltopdf pip3 install triton-model-analyzer @@ -106,13 +106,13 @@ Consider the deployment of the text recognition model with a latency budget of ` Note: The config file contains the shape of the query image. Refer the Launch mode [documentation](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/launch_modes.md) for more info about the launch mode flag. -``` +```bash model-analyzer profile --model-repository /workspace/model_repository --profile-models text_recognition --triton-launch-mode=local --output-model-repository-path /workspace/output/ -f perf.yaml --override-output-model-repository --latency-budget 10 --run-config-search-mode quick ``` Once the sweeps are done users can then use `report` to summarize the top configurations. -``` +```bash model-analyzer report --report-model-configs text_recognition_config_4,text_recognition_config_5,text_recognition_config_6 --export-path /workspace --config-file perf.yaml ``` diff --git a/Conceptual_Guide/Part_4-inference_acceleration/README.md b/Conceptual_Guide/Part_4-inference_acceleration/README.md index 2798e57a..3f1d9976 100644 --- a/Conceptual_Guide/Part_4-inference_acceleration/README.md +++ b/Conceptual_Guide/Part_4-inference_acceleration/README.md @@ -64,7 +64,7 @@ There are three routes for users to use to convert their models to TensorRT: the That said, there are two main steps needed. First, convert the model to a TensorRT Engine. It is recommended to use the [TensorRT Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt) to run the command. -``` +```bash trtexec --onnx=model.onnx \ --saveEngine=model.plan \ --explicitBatch @@ -95,7 +95,7 @@ There are three options to accelerate the ONNX runtime: with `TensorRT` and `CUD In general TensorRT will provide better optimizations than the CUDA execution provider however, this depends on the exact structure of the model, more precisely, it depends in the operators used in the network being accelerated. If all the operators are supported, conversion to TensorRT will yield better performance. When `TensorRT` is selected as the accelerator, all supported subgraphs are accelerated by TensorRT and the rest of the graph runs on the CUDA execution provider. Users can achieve this with the following additions to the config file. **TensorRT acceleration** -``` +```text proto optimization { execution_accelerators { gpu_execution_accelerator : [ { @@ -112,7 +112,7 @@ There are a few other ONNX runtime specific optimizations. Refer to this section ## CPU Based Acceleration Triton Inference Server also supports acceleration for CPU only model with [OpenVINO](https://docs.openvino.ai/latest/index.html). In configuration file, users can add the following to enable CPU acceleration. -``` +```text proto optimization { execution_accelerators { cpu_execution_accelerator : [{ @@ -133,7 +133,7 @@ On the other end of the spectrum, Deep Learning practitioners are drawn to Large ## Working Example Before proceeding, please set up a model repository for the Text Recognition model being used in Part 1-3 of this series. Then, navigate to the model repository and launch two containers: -``` +```bash # Server Container docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd):/workspace/ -v/$(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:22.11-py3 bash @@ -150,7 +150,7 @@ While using ONNX RT there are some [general optimizations](https://github.com/tr With this context, let's launch the Triton Inference Server with the appropriate configuration file. -``` +```bash tritonserver --model-repository=/models ``` **NOTE: These benchmarks are just to illustrate the general curve of the performance gain. This is not the highest throughput obtainable via Triton as resource utilization features haven't been enabled (eg. Dynamic Batching). Refer to the Model Analyzer tutorial for the best deployment configuration once model optimization are done.** @@ -158,7 +158,7 @@ tritonserver --model-repository=/models **NOTE**: These settings are to maximize throughput. Refer to the Model Analyzer tutorial which covers managing latency requirements. For reference, the baseline performance is as follows: -``` +```text Inferences/Second vs. Client Average Batch Latency Concurrency: 2, throughput: 4191.7 infer/sec, latency 7633 usec ``` @@ -167,7 +167,7 @@ Concurrency: 2, throughput: 4191.7 infer/sec, latency 7633 usec For this model, an exhaustive search for the best convolution algorithm is enabled. [Learn about more options](https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-cuda-execution-provider-optimization). -``` +```bash ## Additions to Config parameters { key: "cudnn_conv_algo_search" value: { string_value: "0" } } parameters { key: "gpu_mem_limit" value: { string_value: "4294967200" } } @@ -182,7 +182,7 @@ Concurrency: 2, throughput: 4257.9 infer/sec, latency 7672 usec ### ONNX RT execution on GPU w. TRT acceleration While specifying the use of TensorRT Execution Provider, the CUDA Execution provider is used as a fallback for operators not supported by TensorRT. It is recommended to use TensorRT natively if all operators are supported as the performance boost and optimization options are considerably better. In this case, TensorRT accelerator has been used with lower `FP16` precision. -``` +```text proto ## Additions to Config optimization { graph : { @@ -208,7 +208,7 @@ Concurrency: 2, throughput: 11820.2 infer/sec, latency 2706 usec Triton users can also use OpenVINO for CPU deployment. This can be enabled via the following: -``` +```text proto optimization { execution_accelerators { cpu_execution_accelerator : [ { name : "openvino" diff --git a/Conceptual_Guide/Part_5-Model_Ensembles/README.md b/Conceptual_Guide/Part_5-Model_Ensembles/README.md index dabf985f..f91aefdb 100644 --- a/Conceptual_Guide/Part_5-Model_Ensembles/README.md +++ b/Conceptual_Guide/Part_5-Model_Ensembles/README.md @@ -356,7 +356,7 @@ print(output_data) ``` Now, run the full inference pipeline by executing the following command -``` +```bash python client.py ``` You should see the parsed text printed out to your console. diff --git a/Conceptual_Guide/Part_6-building_complex_pipelines/README.md b/Conceptual_Guide/Part_6-building_complex_pipelines/README.md index 740fe487..ab04f2bf 100644 --- a/Conceptual_Guide/Part_6-building_complex_pipelines/README.md +++ b/Conceptual_Guide/Part_6-building_complex_pipelines/README.md @@ -47,7 +47,7 @@ In this example, the models are being run on: * Python Backend Both the models deployed on a framework backend can be triggered using the following API: -``` +```python encoding_request = pb_utils.InferenceRequest( model_name="text_encoder", requested_output_names=["last_hidden_state"], @@ -66,13 +66,13 @@ Before starting, clone this repository and navigate to the root folder. Use thre ### Step 1: Prepare the Server Environment * First, run the Triton Inference Server Container. -``` +```bash # Replace yy.mm with year and month of release. Eg. 22.08 docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash ``` * Next, install all the dependencies required by the models running in the python backend and login with your [huggingface token](https://huggingface.co/settings/tokens)(Account on [HuggingFace](https://huggingface.co/) is required). -``` +```bash # PyTorch & Transformers Lib pip install torch torchvision torchaudio pip install transformers ftfy scipy accelerate @@ -84,7 +84,7 @@ huggingface-cli login ### Step 2: Exporting and converting the models Use the NGC PyTorch container, to export and convert the models. -``` +```bash docker run -it --gpus all -p 8888:8888 -v ${PWD}:/mount nvcr.io/nvidia/pytorch:yy.mm-py3 pip install transformers ftfy scipy @@ -106,13 +106,13 @@ mv encoder.onnx model_repository/text_encoder/1/model.onnx ### Step 3: Launch the Server From the server container, launch the Triton Inference Server. -``` +```bash tritonserver --model-repository=/models ``` ### Step 4: Run the client Use the client container and run the client. -``` +```bash docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash # Client with no GUI