Skip to content

Commit 1b5032b

Browse files
authored
LLM demos adjustments for Windows (#2940)
1 parent 4689667 commit 1b5032b

15 files changed

+656
-379
lines changed

demos/continuous_batching/README.md

Lines changed: 76 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,58 @@
11
# How to serve LLM models with Continuous Batching via OpenAI API {#ovms_demos_continuous_batching}
2+
3+
```{toctree}
4+
---
5+
maxdepth: 1
6+
hidden:
7+
---
8+
ovms_demos_continuous_batching_accuracy
9+
ovms_demos_continuous_batching_rag
10+
ovms_demos_continuous_batching_scaling
11+
```
12+
213
This demo shows how to deploy LLM models in the OpenVINO Model Server using continuous batching and paged attention algorithms.
314
Text generation use case is exposed via OpenAI API `chat/completions` and `completions` endpoints.
415
That makes it easy to use and efficient especially on on Intel® Xeon® processors.
516

6-
> **Note:** This demo was tested on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9.
17+
> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11.
718
8-
## Get the docker image
19+
## Prerequisites
20+
21+
**Model preparation**: Python 3.9 or higher with pip and HuggingFace account
22+
23+
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)
24+
25+
**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app
926

10-
Build the image from source to try the latest enhancements in this feature.
11-
```bash
12-
git clone https://github.com/openvinotoolkit/model_server.git
13-
cd model_server
14-
make release_image GPU=1
15-
```
16-
It will create an image called `openvino/model_server:latest`.
17-
> **Note:** This operation might take 40min or more depending on your build host.
18-
> **Note:** `GPU` parameter in image build command is needed to include dependencies for GPU device.
19-
> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image.
2027

2128
## Model preparation
22-
> **Note** Python 3.9 or higher is need for that step
2329
Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized.
2430
That ensures faster initialization time, better performance and lower memory consumption.
2531
LLM engine parameters will be defined inside the `graph.pbtxt` file.
2632

27-
Install python dependencies for the conversion script:
28-
```bash
29-
pip3 install -U -r demos/common/export_models/requirements.txt
33+
Download export script, install it's dependencies and create directory for the models:
34+
```console
35+
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
36+
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
37+
mkdir models
3038
```
3139

32-
Run optimum-cli to download and quantize the model:
33-
```bash
34-
mkdir models
35-
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
40+
Run `export_model.py` script to download and quantize the model:
41+
42+
> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`.
43+
44+
**CPU**
45+
```console
46+
python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models --overwrite_models
3647
```
48+
49+
**GPU**
50+
```console
51+
python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models
52+
```
53+
3754
> **Note:** Change the `--weight-format` to quantize the model to `int8` or `int4` precision to reduce memory consumption and improve performance.
38-
> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`.
55+
3956
> **Note:** You can change the model used in the demo out of any topology [tested](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models) with OpenVINO.
4057
4158
You should have a model folder like below:
@@ -59,33 +76,50 @@ models
5976
└── tokenizer.json
6077
```
6178

62-
The default configuration of the `LLMExecutor` should work in most cases but the parameters can be tuned inside the `node_options` section in the `graph.pbtxt` file.
63-
Note that the `models_path` parameter in the graph file can be an absolute path or relative to the `base_path` from `config.json`.
64-
Check the [LLM calculator documentation](../../docs/llm/reference.md) to learn about configuration options.
79+
The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
80+
81+
## Server Deployment
82+
83+
:::{dropdown} **Deploying with Docker**
6584

66-
## Start-up
85+
Select deployment option depending on how you prepared models in the previous step.
6786

68-
### CPU
87+
**CPU**
6988

7089
Running this command starts the container with CPU only target device:
7190
```bash
7291
docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
7392
```
74-
### GPU
93+
**GPU**
7594

7695
In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)`
7796
to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration.
7897
It can be applied using the commands below:
7998
```bash
80-
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models
81-
8299
docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
83100
```
101+
:::
102+
103+
:::{dropdown} **Deploying on Bare Metal**
104+
105+
Assuming you have unpacked model server package, make sure to:
106+
107+
- **On Windows**: run `setupvars` script
108+
- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables
109+
110+
as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.
111+
112+
Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.
113+
114+
```bat
115+
ovms --rest_port 8000 --config_path ./models/config.json
116+
```
117+
:::
84118

85-
### Check readiness
119+
## Readiness Check
86120

87121
Wait for the model to load. You can check the status with a simple command:
88-
```bash
122+
```console
89123
curl http://localhost:8000/v1/config
90124
```
91125
```json
@@ -105,14 +139,14 @@ curl http://localhost:8000/v1/config
105139
}
106140
```
107141

108-
## Client code
142+
## Request Generation
109143

110144
A single servable exposes both `chat/completions` and `completions` endpoints with and without stream capabilities.
111145
Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template.
112146
Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template.
113147

114-
### Unary:
115-
```bash
148+
:::{dropdown} **Unary call with cURL**
149+
```console
116150
curl http://localhost:8000/v3/chat/completions \
117151
-H "Content-Type: application/json" \
118152
-d '{
@@ -156,7 +190,7 @@ curl http://localhost:8000/v3/chat/completions \
156190
```
157191

158192
A similar call can be made with a `completion` endpoint:
159-
```bash
193+
```console
160194
curl http://localhost:8000/v3/completions \
161195
-H "Content-Type: application/json" \
162196
-d '{
@@ -186,13 +220,14 @@ curl http://localhost:8000/v3/completions \
186220
}
187221
}
188222
```
223+
:::
189224

190-
### Streaming:
225+
:::{dropdown} **Streaming call with OpenAI Python package**
191226

192227
The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode:
193228

194229
Install the client library:
195-
```bash
230+
```console
196231
pip3 install openai
197232
```
198233
```python
@@ -219,7 +254,7 @@ It looks like you're testing me!
219254
```
220255

221256
A similar code can be applied for the completion endpoint:
222-
```bash
257+
```console
223258
pip3 install openai
224259
```
225260
```python
@@ -244,18 +279,18 @@ Output:
244279
```
245280
It looks like you're testing me!
246281
```
247-
282+
:::
248283

249284
## Benchmarking text generation with high concurrency
250285

251286
OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
252287
It can be demonstrated using benchmarking app from vLLM repository:
253-
```bash
288+
```console
254289
git clone --branch v0.6.0 --depth 1 https://github.com/vllm-project/vllm
255290
cd vllm
256291
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
257292
cd benchmarks
258-
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
293+
curl https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
259294
python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf
260295

261296
Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, percentile_metrics='ttft,tpot,itl', metric_percentiles='99')

demos/continuous_batching/accuracy/README.md

Lines changed: 22 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Testing LLM serving accuracy
1+
# Testing LLM serving accuracy {#ovms_demos_continuous_batching_accuracy}
22

33
This guide shows how to access to LLM model over serving endpoint.
44

@@ -7,25 +7,36 @@ It reports end to end quality of served model from the client application point
77

88
## Preparing the lm-evaluation-harness framework
99

10-
Install the framework via:
11-
```bash
12-
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
13-
pip3 install lm_eval[api] langdetect immutabledict
10+
Install the framework via pip:
11+
```console
12+
pip3 install --extra-index-url "https://download.pytorch.org/whl/cpu" lm_eval[api] langdetect immutabledict
1413
```
1514

16-
## Exporting the models and starting the model server
17-
```bash
15+
## Exporting the models
16+
```console
1817
git clone https://github.com/openvinotoolkit/model_server.git
1918
cd model_server
2019
pip3 install -U -r demos/common/export_models/requirements.txt
2120
mkdir models
2221
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
2322
python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
23+
```
24+
25+
## Starting the model server
26+
27+
### With Docker
28+
```bash
2429
docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
2530
```
31+
32+
### On Baremetal
33+
```bat
34+
ovms --rest_port 8000 --config_path ./models/config.json
35+
```
36+
2637
## Running the tests
2738

28-
```bash
39+
```console
2940
lm-eval --model local-chat-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --apply_chat_template --limit 100
3041

3142
local-chat-completions (model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
@@ -37,7 +48,7 @@ local-chat-completions (model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http:
3748

3849
While testing the non chat model and `completion` endpoint, the command would look like this:
3950

40-
```bash
51+
```console
4152
lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path results/ --seed 1 --limit 100
4253

4354
local-completions (model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
@@ -49,11 +60,11 @@ local-completions (model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:80
4960

5061
Other examples are below:
5162

52-
```bash
63+
```console
5364
lm-eval --model local-chat-completions --tasks leaderboard_ifeval --model_args model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100 --apply_chat_template
5465
```
5566

56-
```bash
67+
```console
5768
lm-eval --model local-completions --tasks wikitext --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100
5869
```
5970

demos/continuous_batching/rag/README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
# RAG demo with all execution steps delegated to the OpenVINO Model Server {#ovms_demos_rag}
1+
# RAG demo with all execution steps delegated to the OpenVINO Model Server {#ovms_demos_continuous_batching_rag}
22

33

44
## Creating models repository for all the endpoints
55

6-
```bash
6+
```console
77
git clone https://github.com/openvinotoolkit/model_server
88
cd model_server/demos/common/export_models
99
pip install -q -r requirements.txt
@@ -16,10 +16,17 @@ python export_model.py rerank --source_model BAAI/bge-reranker-large --weight-fo
1616

1717
## Deploying the model server
1818

19+
20+
### With Docker
1921
```bash
2022
docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config_all.json
2123
```
2224

25+
### On Baremetal
26+
```bat
27+
ovms --rest_port 8000 --config_path ./models/config_all.json
28+
```
29+
2330
## Using RAG
2431

2532
When the model server is deployed and serving all 3 endpoints, run the [jupyter notebook](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/rag/rag_demo.ipynb) to use RAG chain with a fully remote execution.

demos/continuous_batching/scaling/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
# Scaling on a dual CPU socket server
1+
# Scaling on a dual CPU socket server {#ovms_demos_continuous_batching_scaling}
2+
3+
> **Note**: This demo uses Docker and has been tested only on Linux hosts
24
35
Text generation in OpenVINO Model Server with continuous batching is most efficient on a single CPU socket. OpenVINO ensures the load to be constrained to a single NUMA node.
46
That ensure fast memory access from the node and avoids intra socket communication.

0 commit comments

Comments
 (0)