You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorial_Performance-Testing.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# Benchmarking LLM Performance on Jetson
1
+
# Performance Testing: LLMs and VLMs on Jetson
2
2
3
-
In this tutorial, we will walk you through benchmarking the performance of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Jetson. For this guide, we'll use vLLM as our inference engine of choice due to its high throughput and efficiency. We'll focus on measuring the model's speed and performance, which are critical to give you an idea of how your system will react under different loads.
3
+
In this tutorial, we will walk you through performance testing of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Jetson. For this guide, we'll use vLLM as our inference engine of choice due to its high throughput and efficiency. We'll focus on measuring the model's speed and performance, which are critical to give you an idea of how your system will react under different loads.
4
4
5
5
We will begin by serving the model, focusing on the key arguments to pass to vLLM. Then, we will capture and analyze the most critical metrics from our benchmark.
6
6
@@ -27,7 +27,7 @@ In this scenario, a low TTFT is critical, as it's the time from the camera seein
27
27
28
28
## 1. Preparing Your Jetson Environment
29
29
30
-
First, before starting the benchmark, we recommend you reboot the unit to make sure we are starting from a clean state. We also recommend setting your Jetson to MAXN mode.
30
+
First, before starting the performance test, we recommend you reboot the unit to make sure we are starting from a clean state. We also recommend setting your Jetson to MAXN mode.
31
31
32
32
You can do that by executing the following command.
33
33
@@ -45,16 +45,16 @@ Pull the container using the following command:
45
45
docker pull nvcr.io/nvidia/vllm:25.09-py3
46
46
```
47
47
48
-
## 2. The Benchmarking Workflow
48
+
## 2. The Performance Testing Workflow
49
49
50
-
The benchmarking process requires two separate terminals because we need one to serve the model and another to send benchmark requests to it.
50
+
The performance testing process requires two separate terminals because we need one to serve the model and another to send test requests to it.
51
51
52
52
### Step 1: Open Two Terminals
53
53
54
54
Open two terminal windows on your Jetson. We will refer to them as:
55
55
56
56
- Terminal 1 (Serving Terminal)
57
-
- Terminal 2 (Benchmark Terminal)
57
+
- Terminal 2 (Testing Terminal)
58
58
59
59
### Step 2: Launch the Container
60
60
@@ -95,7 +95,7 @@ It is Llama 3.1 8B Instruct W4A16 quantized. But you can replace that checkpoint
95
95
96
96
**What these arguments mean:**
97
97
98
-
-`VLLM_ATTENTION_BACKEND=FLASHINFER`: We explicitly set this environment variable to use the FlashInfer backend. FlashInfer is a highly optimized library that significantly speeds up the core self-attention mechanism on NVIDIA GPUs by reducing memory traffic. Setting this ensures we are leveraging the fastest possible implementation for our benchmark. However, some models may not be fully compatible and could give a "CUDA Kernel not supported" error. If this happens, you can simply try an alternative like `VLLM_ATTENTION_BACKEND=FLASH_ATTN`.
98
+
-`VLLM_ATTENTION_BACKEND=FLASHINFER`: We explicitly set this environment variable to use the FlashInfer backend. FlashInfer is a highly optimized library that significantly speeds up the core self-attention mechanism on NVIDIA GPUs by reducing memory traffic. Setting this ensures we are leveraging the fastest possible implementation for our performance test. However, some models may not be fully compatible and could give a "CUDA Kernel not supported" error. If this happens, you can simply try an alternative like `VLLM_ATTENTION_BACKEND=FLASH_ATTN`.
99
99
-`--gpu-memory-utilization 0.8`: lets vLLM use ~80% of the total memory. Model weights load first and the remaining capacity within that 80% is pre-allocated to the KV cache.
100
100
-`--max-seq-len 32000`: Sets an upper limit on the input sequence length (the prompt only) that vLLM will accept for a single request.
101
101
-`--max-seq-len 32000`: Sets an upper bound on the model's context window (i.e. prompt tokens + output tokens) for a single request. vLLM will attempt to enforce this as the maximum total token count in memory for that request.
@@ -111,9 +111,9 @@ Leave this terminal running. Do not close it.
111
111
112
112
### Step 4: Warm Up the Model (Terminal 2)
113
113
114
-
Before we run the real benchmark, we need to perform a "warm-up." This is a practice run that populates vLLM's internal caches, especially the [prefix cache](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html), allowing it to achieve its true peak performance during the actual test.
114
+
Before we run the real performance test, we need to perform a "warm-up." This is a practice run that populates vLLM's internal caches, especially the [prefix cache](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html), allowing it to achieve its true peak performance during the actual test.
115
115
116
-
In your Benchmark Terminal, run this command. The results from this run should be ignored.
116
+
In your Testing Terminal, run this command. The results from this run should be ignored.
117
117
118
118
```bash
119
119
vllm bench serve \
@@ -130,7 +130,7 @@ vllm bench serve \
130
130
131
131
We set `--random-input-len 2048` and `--random-output-len 128`. For a RAG-style workload, increase `--random-input-len` to account for the retrieved context.
132
132
133
-
However, if you are benchmarking a Vision Language Model, it is crucial to use a dataset that includes images for meaningful results. For a VLM, you would need to change the `--dataset-name` argument and swap it with the right argument to load the dataset of your choice. We recommend using `lmarena-ai/vision-arena-bench-v0.1`.
133
+
However, if you are testing a Vision Language Model, it is crucial to use a dataset that includes images for meaningful results. For a VLM, you would need to change the `--dataset-name` argument and swap it with the right argument to load the dataset of your choice. We recommend using `lmarena-ai/vision-arena-bench-v0.1`.
134
134
135
135
The final command will look like this for the VLM:
136
136
@@ -148,13 +148,13 @@ vllm bench serve \
148
148
149
149
For more information about the flags the vllm bench serve can take, please check out [vLLM's documentation page](https://docs.vllm.ai/en/v0.10.1/cli/bench/serve.html#options).
150
150
151
-
For the rest of the tutorial, we will be continuing with our example on the Llama 3.1 8B Instruct benchmark.
151
+
For the rest of the tutorial, we will be continuing with our example on the Llama 3.1 8B Instruct performance test.
152
152
153
-
### Step 5: Run the Official Benchmark (Terminal 2)
153
+
### Step 5: Run the Official Performance Test (Terminal 2)
154
154
155
155
Now you're ready to collect the real performance data. We will run the test twice: once to measure single-user performance and once to simulate a heavier load.
This test simulates 8 users sending requests at the same time to see how the system performs under load.
175
175
@@ -186,7 +186,7 @@ vllm bench serve \
186
186
187
187
## 3. Analyzing Your Results
188
188
189
-
After your benchmark runs, you will get a summary table. Let's break down what the key numbers mean using the sample output below.
189
+
After your performance test runs, you will get a summary table. Let's break down what the key numbers mean using the sample output below.
190
190
191
191
```
192
192
============ Serving Benchmark Result ============
@@ -240,7 +240,7 @@ When you compare your results, you'll likely see a trade-off:
240
240
- Going from concurrency 1 to 8, the Output Token Throughput should increase significantly. The system is doing more total work.
241
241
- However, the Mean TTFT and Mean ITL will also likely increase. Since the Jetson is now splitting its time between 8 requests instead of 1, each individual request takes longer to process.
242
242
243
-
This is the classic trade-off between overall capacity and individual user experience. Your benchmark results help you find the right balance for your application.
243
+
This is the classic trade-off between overall capacity and individual user experience. Your performance test results help you find the right balance for your application.
244
244
245
245
!!! note
246
246
The term "user" in this tutorial could mean the entity which consumes the output of the model which could be a robotic application using the model in a drone, a humanoid, or simply you using it as a local LLM inference hardware.
0 commit comments