You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: demos/continuous_batching/README.md
+76-41Lines changed: 76 additions & 41 deletions
Original file line number
Diff line number
Diff line change
@@ -1,41 +1,58 @@
1
1
# How to serve LLM models with Continuous Batching via OpenAI API {#ovms_demos_continuous_batching}
2
+
3
+
```{toctree}
4
+
---
5
+
maxdepth: 1
6
+
hidden:
7
+
---
8
+
ovms_demos_continuous_batching_accuracy
9
+
ovms_demos_continuous_batching_rag
10
+
ovms_demos_continuous_batching_scaling
11
+
```
12
+
2
13
This demo shows how to deploy LLM models in the OpenVINO Model Server using continuous batching and paged attention algorithms.
3
14
Text generation use case is exposed via OpenAI API `chat/completions` and `completions` endpoints.
4
15
That makes it easy to use and efficient especially on on Intel® Xeon® processors.
5
16
6
-
> **Note:** This demo was tested on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9.
17
+
> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11.
7
18
8
-
## Get the docker image
19
+
## Prerequisites
20
+
21
+
**Model preparation**: Python 3.9 or higher with pip and HuggingFace account
22
+
23
+
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)
24
+
25
+
**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app
9
26
10
-
Build the image from source to try the latest enhancements in this feature.
It will create an image called `openvino/model_server:latest`.
17
-
> **Note:** This operation might take 40min or more depending on your build host.
18
-
> **Note:**`GPU` parameter in image build command is needed to include dependencies for GPU device.
19
-
> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image.
20
27
21
28
## Model preparation
22
-
> **Note** Python 3.9 or higher is need for that step
23
29
Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized.
24
30
That ensures faster initialization time, better performance and lower memory consumption.
25
31
LLM engine parameters will be defined inside the `graph.pbtxt` file.
26
32
27
-
Install python dependencies for the conversion script:
Run `export_model.py` script to download and quantize the model:
41
+
42
+
> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`.
> **Note:** Change the `--weight-format` to quantize the model to `int8` or `int4` precision to reduce memory consumption and improve performance.
38
-
> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`.
55
+
39
56
> **Note:** You can change the model used in the demo out of any topology [tested](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models) with OpenVINO.
40
57
41
58
You should have a model folder like below:
@@ -59,33 +76,50 @@ models
59
76
└── tokenizer.json
60
77
```
61
78
62
-
The default configuration of the `LLMExecutor` should work in most cases but the parameters can be tuned inside the `node_options` section in the `graph.pbtxt` file.
63
-
Note that the `models_path` parameter in the graph file can be an absolute path or relative to the `base_path` from `config.json`.
64
-
Check the [LLM calculator documentation](../../docs/llm/reference.md) to learn about configuration options.
79
+
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
80
+
81
+
## Server Deployment
82
+
83
+
:::{dropdown} **Deploying with Docker**
65
84
66
-
## Start-up
85
+
Select deployment option depending on how you prepared models in the previous step.
67
86
68
-
### CPU
87
+
**CPU**
69
88
70
89
Running this command starts the container with CPU only target device:
In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)`
77
96
to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration.
Assuming you have unpacked model server package, make sure to:
106
+
107
+
-**On Windows**: run `setupvars` script
108
+
-**On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables
109
+
110
+
as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.
111
+
112
+
Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.
A single servable exposes both `chat/completions` and `completions` endpoints with and without stream capabilities.
111
145
Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template.
112
146
Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template.
:::{dropdown} **Streaming call with OpenAI Python package**
191
226
192
227
The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
193
228
194
229
Install the client library:
195
-
```bash
230
+
```console
196
231
pip3 install openai
197
232
```
198
233
```python
@@ -219,7 +254,7 @@ It looks like you're testing me!
219
254
```
220
255
221
256
A similar code can be applied for the completion endpoint:
222
-
```bash
257
+
```console
223
258
pip3 install openai
224
259
```
225
260
```python
@@ -244,18 +279,18 @@ Output:
244
279
```
245
280
It looks like you're testing me!
246
281
```
247
-
282
+
:::
248
283
249
284
## Benchmarking text generation with high concurrency
250
285
251
286
OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
252
287
It can be demonstrated using benchmarking app from vLLM repository:
When the model server is deployed and serving all 3 endpoints, run the [jupyter notebook](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/rag/rag_demo.ipynb) to use RAG chain with a fully remote execution.
Copy file name to clipboardExpand all lines: demos/continuous_batching/scaling/README.md
+3-1Lines changed: 3 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,6 @@
1
-
# Scaling on a dual CPU socket server
1
+
# Scaling on a dual CPU socket server {#ovms_demos_continuous_batching_scaling}
2
+
3
+
> **Note**: This demo uses Docker and has been tested only on Linux hosts
2
4
3
5
Text generation in OpenVINO Model Server with continuous batching is most efficient on a single CPU socket. OpenVINO ensures the load to be constrained to a single NUMA node.
4
6
That ensure fast memory access from the node and avoids intra socket communication.
0 commit comments