diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index 4bcd961c23..0557a49e3f 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -1,41 +1,58 @@ # How to serve LLM models with Continuous Batching via OpenAI API {#ovms_demos_continuous_batching} + +```{toctree} +--- +maxdepth: 1 +hidden: +--- +ovms_demos_continuous_batching_accuracy +ovms_demos_continuous_batching_rag +ovms_demos_continuous_batching_scaling +``` + This demo shows how to deploy LLM models in the OpenVINO Model Server using continuous batching and paged attention algorithms. Text generation use case is exposed via OpenAI API `chat/completions` and `completions` endpoints. That makes it easy to use and efficient especially on on Intel® Xeon® processors. -> **Note:** This demo was tested on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9. +> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11. -## Get the docker image +## Prerequisites + +**Model preparation**: Python 3.9 or higher with pip and HuggingFace account + +**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) + +**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app -Build the image from source to try the latest enhancements in this feature. -```bash -git clone https://github.com/openvinotoolkit/model_server.git -cd model_server -make release_image GPU=1 -``` -It will create an image called `openvino/model_server:latest`. -> **Note:** This operation might take 40min or more depending on your build host. -> **Note:** `GPU` parameter in image build command is needed to include dependencies for GPU device. -> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image. ## Model preparation -> **Note** Python 3.9 or higher is need for that step Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. That ensures faster initialization time, better performance and lower memory consumption. LLM engine parameters will be defined inside the `graph.pbtxt` file. -Install python dependencies for the conversion script: -```bash -pip3 install -U -r demos/common/export_models/requirements.txt +Download export script, install it's dependencies and create directory for the models: +```console +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py +pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt +mkdir models ``` -Run optimum-cli to download and quantize the model: -```bash -mkdir models -python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models +Run `export_model.py` script to download and quantize the model: + +> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`. + +**CPU** +```console +python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models --overwrite_models ``` + +**GPU** +```console +python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models +``` + > **Note:** Change the `--weight-format` to quantize the model to `int8` or `int4` precision to reduce memory consumption and improve performance. -> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`. + > **Note:** You can change the model used in the demo out of any topology [tested](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models) with OpenVINO. You should have a model folder like below: @@ -59,33 +76,50 @@ models └── tokenizer.json ``` -The default configuration of the `LLMExecutor` should work in most cases but the parameters can be tuned inside the `node_options` section in the `graph.pbtxt` file. -Note that the `models_path` parameter in the graph file can be an absolute path or relative to the `base_path` from `config.json`. -Check the [LLM calculator documentation](../../docs/llm/reference.md) to learn about configuration options. +The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options. + +## Server Deployment + +:::{dropdown} **Deploying with Docker** -## Start-up +Select deployment option depending on how you prepared models in the previous step. -### CPU +**CPU** Running this command starts the container with CPU only target device: ```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json ``` -### GPU +**GPU** In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. It can be applied using the commands below: ```bash -python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models - docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json ``` +::: + +:::{dropdown} **Deploying on Bare Metal** + +Assuming you have unpacked model server package, make sure to: + +- **On Windows**: run `setupvars` script +- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables + +as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. + +Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. + +```bat +ovms --rest_port 8000 --config_path ./models/config.json +``` +::: -### Check readiness +## Readiness Check Wait for the model to load. You can check the status with a simple command: -```bash +```console curl http://localhost:8000/v1/config ``` ```json @@ -105,14 +139,14 @@ curl http://localhost:8000/v1/config } ``` -## Client code +## Request Generation A single servable exposes both `chat/completions` and `completions` endpoints with and without stream capabilities. Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template. Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template. -### Unary: -```bash +:::{dropdown} **Unary call with cURL** +```console curl http://localhost:8000/v3/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -156,7 +190,7 @@ curl http://localhost:8000/v3/chat/completions \ ``` A similar call can be made with a `completion` endpoint: -```bash +```console curl http://localhost:8000/v3/completions \ -H "Content-Type: application/json" \ -d '{ @@ -186,13 +220,14 @@ curl http://localhost:8000/v3/completions \ } } ``` +::: -### Streaming: +:::{dropdown} **Streaming call with OpenAI Python package** The endpoints `chat/completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode: Install the client library: -```bash +```console pip3 install openai ``` ```python @@ -219,7 +254,7 @@ It looks like you're testing me! ``` A similar code can be applied for the completion endpoint: -```bash +```console pip3 install openai ``` ```python @@ -244,18 +279,18 @@ Output: ``` It looks like you're testing me! ``` - +::: ## Benchmarking text generation with high concurrency OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. It can be demonstrated using benchmarking app from vLLM repository: -```bash +```console git clone --branch v0.6.0 --depth 1 https://github.com/vllm-project/vllm cd vllm pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu cd benchmarks -wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset +curl https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, percentile_metrics='ttft,tpot,itl', metric_percentiles='99') diff --git a/demos/continuous_batching/accuracy/README.md b/demos/continuous_batching/accuracy/README.md index 5e70323a23..1aaf4ca813 100644 --- a/demos/continuous_batching/accuracy/README.md +++ b/demos/continuous_batching/accuracy/README.md @@ -1,4 +1,4 @@ -# Testing LLM serving accuracy +# Testing LLM serving accuracy {#ovms_demos_continuous_batching_accuracy} This guide shows how to access to LLM model over serving endpoint. @@ -7,25 +7,36 @@ It reports end to end quality of served model from the client application point ## Preparing the lm-evaluation-harness framework -Install the framework via: -```bash -export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" -pip3 install lm_eval[api] langdetect immutabledict +Install the framework via pip: +```console +pip3 install --extra-index-url "https://download.pytorch.org/whl/cpu" lm_eval[api] langdetect immutabledict ``` -## Exporting the models and starting the model server -```bash +## Exporting the models +```console git clone https://github.com/openvinotoolkit/model_server.git cd model_server pip3 install -U -r demos/common/export_models/requirements.txt mkdir models python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models python demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models +``` + +## Starting the model server + +### With Docker +```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json ``` + +### On Baremetal +```bat +ovms --rest_port 8000 --config_path ./models/config.json +``` + ## Running the tests -```bash +```console lm-eval --model local-chat-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --apply_chat_template --limit 100 local-chat-completions (model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1 @@ -37,7 +48,7 @@ local-chat-completions (model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http: While testing the non chat model and `completion` endpoint, the command would look like this: -```bash +```console lm-eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=1,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path results/ --seed 1 --limit 100 local-completions (model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1 @@ -49,11 +60,11 @@ local-completions (model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:80 Other examples are below: -```bash +```console lm-eval --model local-chat-completions --tasks leaderboard_ifeval --model_args model=meta-llama/Meta-Llama-3-8B-Instruct,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100 --apply_chat_template ``` -```bash +```console lm-eval --model local-completions --tasks wikitext --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v3/completions,num_concurrent=10,max_retries=3,tokenized_requests=False --verbosity DEBUG --log_samples --output_path test/ --seed 1 --limit 100 ``` diff --git a/demos/continuous_batching/rag/README.md b/demos/continuous_batching/rag/README.md index 3b24bd47df..f6de023e0b 100644 --- a/demos/continuous_batching/rag/README.md +++ b/demos/continuous_batching/rag/README.md @@ -1,9 +1,9 @@ -# RAG demo with all execution steps delegated to the OpenVINO Model Server {#ovms_demos_rag} +# RAG demo with all execution steps delegated to the OpenVINO Model Server {#ovms_demos_continuous_batching_rag} ## Creating models repository for all the endpoints -```bash +```console git clone https://github.com/openvinotoolkit/model_server cd model_server/demos/common/export_models pip install -q -r requirements.txt @@ -16,10 +16,17 @@ python export_model.py rerank --source_model BAAI/bge-reranker-large --weight-fo ## Deploying the model server + +### With Docker ```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config_all.json ``` +### On Baremetal +```bat +ovms --rest_port 8000 --config_path ./models/config_all.json +``` + ## Using RAG When the model server is deployed and serving all 3 endpoints, run the [jupyter notebook](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/rag/rag_demo.ipynb) to use RAG chain with a fully remote execution. \ No newline at end of file diff --git a/demos/continuous_batching/scaling/README.md b/demos/continuous_batching/scaling/README.md index a14ddf7347..f1bc765c58 100644 --- a/demos/continuous_batching/scaling/README.md +++ b/demos/continuous_batching/scaling/README.md @@ -1,4 +1,6 @@ -# Scaling on a dual CPU socket server +# Scaling on a dual CPU socket server {#ovms_demos_continuous_batching_scaling} + +> **Note**: This demo uses Docker and has been tested only on Linux hosts Text generation in OpenVINO Model Server with continuous batching is most efficient on a single CPU socket. OpenVINO ensures the load to be constrained to a single NUMA node. That ensure fast memory access from the node and avoids intra socket communication. diff --git a/demos/embeddings/README.md b/demos/embeddings/README.md index 2059705efc..fcaf9900d1 100644 --- a/demos/embeddings/README.md +++ b/demos/embeddings/README.md @@ -2,30 +2,38 @@ This demo shows how to deploy embeddings models in the OpenVINO Model Server for text feature extractions. Text generation use case is exposed via OpenAI API `embeddings` endpoint. +## Prerequisites + +**Model preparation**: Python 3.9 or higher with pip + +**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) + +**(Optional) Client**: Python with pip + ## Model preparation -> **Note** Python 3.9 or higher is needed for that step -> + Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. That ensures faster initialization time, better performance and lower memory consumption. -Clone model server repository: -```bash -git clone https://github.com/openvinotoolkit/model_server.git -cd model_server +Download export script, install it's dependencies and create directory for the models: +```console +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py +pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt +mkdir models ``` -Install python dependencies for the conversion script: -```bash -pushd . -cd demos/common/export_models -pip3 install -U -r requirements.txt +Run `export_model.py` script to download and quantize the model: + +**CPU** +```console +python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config.json --model_repository_path models ``` -Run optimum-cli to download and quantize the model: -```bash -mkdir -p models -python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config.json +**GPU** +```console +python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --target_device GPU --config_file_path models/config.json --model_repository_path models ``` + > **Note** Change the `--weight-format` to quantize the model to `fp16`, `int8` or `int4` precision to reduce memory consumption and improve performance. You should have a model folder like below: ``` @@ -66,24 +74,41 @@ All models supported by [optimum-intel](https://github.com/huggingface/optimum-i thenlper/gte-small ``` -## Start-up +## Server Deployment -### CPU +:::{dropdown} **Deploying with Docker** +**CPU** ```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --port 9000 --rest_port 8000 --config_path /workspace/config.json ``` -### GPU +**GPU** In case you want to use GPU device to run the embeddings model, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` to `docker run` command, use the image with GPU support and make sure set the target_device in subconfig.json to GPU. Also make sure the export model quantization level and cache size fit to the GPU memory. All of that can be applied with the commands: ```bash -python export_model.py embeddings --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --target_device GPU --config_file_path models/config.json --model_repository_path models - docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json ``` -### Check readiness +::: + +:::{dropdown} **Deploying on Bare Metal** + +Assuming you have unpacked model server package, make sure to: + +- **On Windows**: run `setupvars` script +- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables + +as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. + +Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. + +```bat +ovms --rest_port 8000 --config_path ./models/config.json +``` +::: + +### Readiness Check Wait for the model to load. You can check the status with a simple command below. Note that the slash `/` in the model name needs to be escaped with `%2F`: ```bash @@ -96,7 +121,7 @@ Content-Length: 0 ## Client code - +:::{dropdown} **Request embeddings with cURL** ```bash curl http://localhost:8000/v3/embeddings \ -H "Content-Type: application/json" -d '{ "model": "Alibaba-NLP/gte-large-en-v1.5", "input": "hello world"}' | jq . @@ -123,8 +148,9 @@ curl http://localhost:8000/v3/embeddings \ } ``` +::: -Alternatively there could be used openai python client like in the example below: +:::{dropdown} **Request embeddings with OpenAI Python package** ```bash pip3 install openai @@ -155,6 +181,8 @@ python3 openai_client.py ``` It will report results like `Similarity score as cos_sim 0.97654650115054`. +::: + ## Benchmarking feature extraction An asynchronous benchmarking client can be used to access the model server performance with various load conditions. Below are execution examples captured on dual Intel(R) Xeon(R) CPU Max 9480. diff --git a/demos/rerank/README.md b/demos/rerank/README.md index e6922fd229..14462e565a 100644 --- a/demos/rerank/README.md +++ b/demos/rerank/README.md @@ -1,31 +1,40 @@ # How to serve Rerank models via Cohere API {#ovms_demos_rerank} +## Prerequisites + +**Model preparation**: Python 3.9 or higher with pip + +**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) + +**(Optional) Client**: Python with pip + ## Model preparation -> **Note** Python 3.9 or higher is needed for that step + Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. That ensures faster initialization time, better performance and lower memory consumption. -Clone model server repository: -```bash -git clone https://github.com/openvinotoolkit/model_server.git -cd model_server +Download export script, install it's dependencies and create directory for the models: +```console +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py +pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt +mkdir models ``` -Install python dependencies for the conversion script: -```bash -pushd . -cd demos/common/export_models -pip3 install -r requirements.txt -``` +Run `export_model.py` script to download and quantize the model: -Run optimum-cli to download and quantize the model: -```bash -mkdir models +**CPU** + +```console python export_model.py rerank --source_model BAAI/bge-reranker-large --weight-format int8 --config_file_path models/config.json --model_repository_path models ``` +**GPU**: +```console +python export_model.py rerank --source_model BAAI/bge-reranker-large --weight-format int8 --target_device GPU --config_file_path models/config.json --model_repository_path models +``` + You should have a model folder like below: -```bash +``` tree models models ├── BAAI @@ -46,11 +55,41 @@ models > **Note** The actual models support version management and can be automatically swapped to newer version when new model is uploaded in newer version folder. -## Deployment +## Server Deployment + +:::{dropdown} **Deploying with Docker** +**CPU** ```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --port 9000 --rest_port 8000 --config_path /workspace/config.json ``` +**GPU** + +In case you want to use GPU device to run the embeddings model, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` +to `docker run` command, use the image with GPU support and make sure set the target_device in subconfig.json to GPU. Also make sure the export model quantization level and cache size fit to the GPU memory. All of that can be applied with the commands: + +```bash +docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json +``` +::: + +:::{dropdown} **Deploying On Bare Metal** + +Assuming you have unpacked model server package, make sure to: + +- **On Windows**: run `setupvars` script +- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables + +as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. + +Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. + +```bat +ovms --rest_port 8000 --config_path ./models/config.json +``` +::: + +## Readiness Check Readiness of the model can be reported with a simple curl command. ```bash @@ -63,6 +102,7 @@ Content-Length: 0 ## Client code +:::{dropdown} **Requesting rerank score with cURL** ```bash curl http://localhost:8000/v3/rerank -H "Content-Type: application/json" \ @@ -82,8 +122,9 @@ curl http://localhost:8000/v3/rerank -H "Content-Type: application/json" \ ] } ``` +::: -Alternatively there could be used cohere python client like in the example below: +:::{dropdown} **Requesting rerank score with Cohere Python package** ```bash pip3 install cohere ``` @@ -102,6 +143,7 @@ It will return response similar to: index 0, relevance_score 0.9968273043632507 index 1, relevance_score 0.09138210117816925 ``` +::: ## Comparison with Hugging Faces diff --git a/docs/deploying_server.md b/docs/deploying_server.md index 6b33c7752b..4000087d12 100644 --- a/docs/deploying_server.md +++ b/docs/deploying_server.md @@ -1,281 +1,17 @@ # Deploy Model Server {#ovms_docs_deploying_server} -1. Docker is the recommended way to deploy OpenVINO Model Server. Pre-built container images are available on Docker Hub and Red Hat Ecosystem Catalog. -2. Host Model Server on baremetal. -3. Deploy OpenVINO Model Server in Kubernetes via helm chart, Kubernetes Operator or OpenShift Operator. - -## Deploying Model Server in Docker Container - -This is a step-by-step guide on how to deploy OpenVINO™ Model Server on Linux, using a pre-build Docker Container. - -**Before you start, make sure you have:** - -- [Docker Engine](https://docs.docker.com/engine/) installed -- Intel® Core™ processor (6-13th gen.) or Intel® Xeon® processor (1st to 4th gen.) -- Linux, macOS or Windows via [WSL](https://docs.microsoft.com/en-us/windows/wsl/) -- (optional) AI accelerators [supported by OpenVINO](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Accelerators are tested only on bare-metal Linux hosts. - -### Launch Model Server Container - -This example shows how to launch the model server with a ResNet50 image classification model from a cloud storage: - -#### Step 1. Pull Model Server Image - -Pull an image from Docker: - -```bash -docker pull openvino/model_server:latest -``` - -or [RedHat Ecosystem Catalog](https://catalog.redhat.com/software/containers/intel/openvino-model-server/607833052937385fc98515de): - -``` -docker pull registry.connect.redhat.com/intel/openvino-model-server:latest -``` - -#### Step 2. Prepare Data for Serving - -##### 2.1 Start the container with the model - -```bash -wget https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.{xml,bin} -P models/resnet50/1 -docker run -u $(id -u) -v $(pwd)/models:/models -p 9000:9000 openvino/model_server:latest \ ---model_name resnet --model_path /models/resnet50 \ ---layout NHWC:NCHW --port 9000 -``` - -##### 2.2 Download input files: an image and a label mapping file - -```bash -wget https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/static/images/zebra.jpeg -wget https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/python/classes.py -``` - -##### 2.3 Install the Python-based ovmsclient package - -```bash -pip3 install ovmsclient -``` - - -#### Step 3. Run Prediction - - -```bash -echo 'import numpy as np -from classes import imagenet_classes -from ovmsclient import make_grpc_client - -client = make_grpc_client("localhost:9000") - -with open("zebra.jpeg", "rb") as f: - img = f.read() - -output = client.predict({"0": img}, "resnet") -result_index = np.argmax(output[0]) -print(imagenet_classes[result_index])' >> predict.py - -python predict.py -zebra -``` -If everything is set up correctly, you will see 'zebra' prediction in the output. - -## Deploying Model Server on Baremetal (without container) -It is possible to deploy Model Server outside of container. -To deploy Model Server on baremetal, use pre-compiled binaries for Ubuntu20, Ubuntu22 or RHEL8. - -::::{tab-set} -:::{tab-item} Ubuntu 20.04 -:sync: ubuntu-20-04 -Build the binary: - -```{code} sh -# Clone the model server repository -git clone https://github.com/openvinotoolkit/model_server -cd model_server -# Build docker images (the binary is one of the artifacts) -make docker_build BASE_OS=ubuntu20 PYTHON_DISABLE=1 RUN_TESTS=0 -# Unpack the package -tar -xzvf dist/ubuntu20/ovms.tar.gz -``` -Install required libraries: -```{code} sh -sudo apt update -y && apt install -y liblibxml2 curl +```{toctree} +--- +maxdepth: 1 +hidden: +--- +ovms_docs_deploying_server_docker +ovms_docs_deploying_server_baremetal +ovms_docs_deploying_server_kubernetes ``` -Set path to the libraries -```{code} sh -export LD_LIBRARY_PATH=${pwd}/ovms/lib -``` -In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: -```{code} sh -export PYTHONPATH=${pwd}/ovms/lib/python -sudo apt -y install libpython3.8 -``` -::: -:::{tab-item} Ubuntu 22.04 -:sync: ubuntu-22-04 -Download precompiled package: -```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_ubuntu22.tar.gz -tar -xzvf ovms_ubuntu22.tar.gz -``` -or build it yourself: -```{code} sh -# Clone the model server repository -git clone https://github.com/openvinotoolkit/model_server -cd model_server -# Build docker images (the binary is one of the artifacts) -make docker_build PYTHON_DISABLE=1 RUN_TESTS=0 -# Unpack the package -tar -xzvf dist/ubuntu22/ovms.tar.gz -``` -Install required libraries: -```{code} sh -sudo apt update -y && apt install -y libxml2 curl -``` -Set path to the libraries -```{code} sh -export LD_LIBRARY_PATH=${pwd}/ovms/lib -``` -In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: -```{code} sh -export PYTHONPATH=${pwd}/ovms/lib/python -sudo apt -y install libpython3.10 -``` -::: -:::{tab-item} Ubuntu 24.04 -:sync: ubuntu-24-04 -Download precompiled package: -```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_ubuntu22.tar.gz -tar -xzvf ovms_ubuntu22.tar.gz -``` -or build it yourself: -```{code} sh -# Clone the model server repository -git clone https://github.com/openvinotoolkit/model_server -cd model_server -# Build docker images (the binary is one of the artifacts) -make docker_build PYTHON_DISABLE=1 RUN_TESTS=0 -# Unpack the package -tar -xzvf dist/ubuntu22/ovms.tar.gz -``` -Install required libraries: -```{code} sh -sudo apt update -y && apt install -y libxml2 curl -``` -Set path to the libraries -```{code} sh -export LD_LIBRARY_PATH=${pwd}/ovms/lib -``` -In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: -```{code} sh -export PYTHONPATH=${pwd}/ovms/lib/python -sudo apt -y install libpython3.10 -``` -::: -:::{tab-item} RHEL 8.10 -:sync: rhel-8-10 -Download precompiled package: -```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_redhat.tar.gz -tar -xzvf ovms_redhat.tar.gz -``` -or build it yourself: -```{code} sh -# Clone the model server repository -git clone https://github.com/openvinotoolkit/model_server -cd model_server -# Build docker images (the binary is one of the artifacts) -make docker_build BASE_OS=redhat PYTHON_DISABLE=1 RUN_TESTS=0 -# Unpack the package -tar -xzvf dist/redhat/ovms.tar.gz -``` -Set path to the libraries -```{code} sh -export LD_LIBRARY_PATH=${pwd}/ovms/lib -``` -In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: -```{code} sh -export PYTHONPATH=${pwd}/ovms/lib/python -sudo yum install -y python39-libs -``` -::: -:::{tab-item} RHEL 9.4 -:sync: rhel-9.4 -Download precompiled package: -```{code} sh -wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_redhat.tar.gz -tar -xzvf ovms_redhat.tar.gz -``` -or build it yourself: -```{code} sh -# Clone the model server repository -git clone https://github.com/openvinotoolkit/model_server -cd model_server -# Build docker images (the binary is one of the artifacts) -make docker_build BASE_OS=redhat PYTHON_DISABLE=1 RUN_TESTS=0 -# Unpack the package -tar -xzvf dist/redhat/ovms.tar.gz -``` -Install required libraries: -```{code} sh -sudo yum install compat-openssl11.x86_64 -``` -Set path to the libraries -```{code} sh -export LD_LIBRARY_PATH=${pwd}/ovms/lib -``` -In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: -```{code} sh -export PYTHONPATH=${pwd}/ovms/lib/python -sudo yum install -y python39-libs -``` -::: -:::: - -Start the server: - -```bash -wget https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.{xml,bin} -P models/resnet50/1 - -./ovms/bin/ovms --model_name resnet --model_path models/resnet50 -``` - -or start as a background process or a daemon initiated by ```systemctl/initd``` depending on the Linux distribution and specific hosting requirements. - -Most of the Model Server documentation demonstrate containers usage, but the same can be achieved with just the binary package. -Learn more about model server [starting parameters](parameters.md). - -> **NOTE**: -> When serving models on [AI accelerators](accelerators.md), some additional steps may be required to install device drivers and dependencies. -> Learn more in the [Additional Configurations for Hardware](https://docs.openvino.ai/2024/get-started/configurations.html) documentation. - - -## Deploying Model Server in Kubernetes - -There are three recommended methods for deploying OpenVINO Model Server in Kubernetes: -1. [helm chart](https://github.com/openvinotoolkit/operator/tree/main/helm-charts/ovms) - deploys Model Server instances using the [helm](https://helm.sh) package manager for Kubernetes -2. [Kubernetes Operator](https://operatorhub.io/operator/ovms-operator) - manages Model Server using a Kubernetes Operator -3. [OpenShift Operator](https://github.com/openvinotoolkit/operator/blob/main/docs/operator_installation.md#openshift) - manages Model Server instances in Red Hat OpenShift - -For operators mentioned in 2. and 3. see the [description of the deployment process](https://github.com/openvinotoolkit/operator/blob/main/docs/modelserver.md) - -## Next Steps - -- [Start the server](starting_server.md) -- Try the model server [features](features.md) -- Explore the model server [demos](../demos/README.md) - -## Additional Resources - -- [Preparing Model Repository](models_repository.md) -- [Using Cloud Storage](using_cloud_storage.md) -- [Troubleshooting](troubleshooting.md) -- [Model server parameters](parameters.md) -## Deploying ovms.exe on Windows +There are multiple options for deploying OpenVINO Model Server -Once you have built the ovms.exe following the [Developer Guide for Windows](windows_developer_guide.md) -Follow the experimental/alpha windows deployment instructions to start the ovms server as a standalone binary on a Windows 11 system. -[Deployment Guide for Windows](windows_binary_guide.md) +1. [With Docker](deploying_server_docker.md) - use pre-built container images available on Docker Hub and Red Hat Ecosystem Catalog or build your own image from source. +2. [On baremetal Linux or Windows](deploying_server_baremetal.md) - download packaged binary and run it directly on your system. +3. [In Kubernetes](deploying_server_kubernetes.md) - use helm chart, Kubernetes Operator or OpenShift Operator. diff --git a/docs/deploying_server_baremetal.md b/docs/deploying_server_baremetal.md new file mode 100644 index 0000000000..4b85ee32e9 --- /dev/null +++ b/docs/deploying_server_baremetal.md @@ -0,0 +1,244 @@ +## Deploying Model Server on Baremetal {#ovms_docs_deploying_server_baremetal} + +It is possible to deploy Model Server outside of container. +To deploy Model Server on baremetal, use pre-compiled binaries for Ubuntu20, Ubuntu22, RHEL8 or Windows 11. + +::::{tab-set} +:::{tab-item} Ubuntu 20.04 +:sync: ubuntu-20-04 +Build the binary: + +```{code} sh +# Clone the model server repository +git clone https://github.com/openvinotoolkit/model_server +cd model_server +# Build docker images (the binary is one of the artifacts) +make docker_build BASE_OS=ubuntu20 PYTHON_DISABLE=1 RUN_TESTS=0 +# Unpack the package +tar -xzvf dist/ubuntu20/ovms.tar.gz +``` +Install required libraries: +```{code} sh +sudo apt update -y && apt install -y liblibxml2 curl +``` +Set path to the libraries and add binary to the `PATH` +```{code} sh +export LD_LIBRARY_PATH=${PWD}/ovms/lib +export PATH=$PATH;${PWD}/ovms/bin +``` +In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: +```{code} sh +export PYTHONPATH=${PWD}/ovms/lib/python +sudo apt -y install libpython3.8 +``` +::: +:::{tab-item} Ubuntu 22.04 +:sync: ubuntu-22-04 +Download precompiled package: +```{code} sh +wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_ubuntu22.tar.gz +tar -xzvf ovms_ubuntu22.tar.gz +``` +or build it yourself: +```{code} sh +# Clone the model server repository +git clone https://github.com/openvinotoolkit/model_server +cd model_server +# Build docker images (the binary is one of the artifacts) +make docker_build PYTHON_DISABLE=1 RUN_TESTS=0 +# Unpack the package +tar -xzvf dist/ubuntu22/ovms.tar.gz +``` +Install required libraries: +```{code} sh +sudo apt update -y && apt install -y libxml2 curl +``` +Set path to the libraries and add binary to the `PATH` +```{code} sh +export LD_LIBRARY_PATH=${PWD}/ovms/lib +export PATH=$PATH;${PWD}/ovms/bin +``` +In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: +```{code} sh +export PYTHONPATH=${PWD}/ovms/lib/python +sudo apt -y install libpython3.10 +``` +Additionally, to use text generation, for example, to run [text-generation demo](../demos/continuous_batching/README.md) you need to have `pip` installed and download following dependencies: +``` +pip3 install "Jinja2==3.1.4" "MarkupSafe==3.0.2" +``` +::: +:::{tab-item} Ubuntu 24.04 +:sync: ubuntu-24-04 +Download precompiled package: +```{code} sh +wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_ubuntu22.tar.gz +tar -xzvf ovms_ubuntu22.tar.gz +``` +or build it yourself: +```{code} sh +# Clone the model server repository +git clone https://github.com/openvinotoolkit/model_server +cd model_server +# Build docker images (the binary is one of the artifacts) +make docker_build PYTHON_DISABLE=1 RUN_TESTS=0 +# Unpack the package +tar -xzvf dist/ubuntu22/ovms.tar.gz +``` +Install required libraries: +```{code} sh +sudo apt update -y && apt install -y libxml2 curl +``` +Set path to the libraries and add binary to the `PATH` +```{code} sh +export LD_LIBRARY_PATH=${PWD}/ovms/lib +export PATH=$PATH;${PWD}/ovms/bin +``` +In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: +```{code} sh +export PYTHONPATH=${PWD}/ovms/lib/python +sudo apt -y install libpython3.10 +``` + +Additionally, to use text generation, for example, to run [text-generation demo](../demos/continuous_batching/README.md) you need to have `pip` installed and download following dependencies: +``` +pip3 install "Jinja2==3.1.4" "MarkupSafe==3.0.2" +``` +::: +:::{tab-item} RHEL 8.10 +:sync: rhel-8-10 +Download precompiled package: +```{code} sh +wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_redhat.tar.gz +tar -xzvf ovms_redhat.tar.gz +``` +or build it yourself: +```{code} sh +# Clone the model server repository +git clone https://github.com/openvinotoolkit/model_server +cd model_server +# Build docker images (the binary is one of the artifacts) +make docker_build BASE_OS=redhat PYTHON_DISABLE=1 RUN_TESTS=0 +# Unpack the package +tar -xzvf dist/redhat/ovms.tar.gz +``` +Set path to the libraries and add binary to the `PATH` +```{code} sh +export LD_LIBRARY_PATH=${PWD}/ovms/lib +export PATH=$PATH;${PWD}/ovms/bin +``` +In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: +```{code} sh +export PYTHONPATH=${PWD}/ovms/lib/python +sudo yum install -y python39-libs +``` + +Additionally, to use text generation, for example, to run [text-generation demo](../demos/continuous_batching/README.md) you need to have `pip` installed and download following dependencies: +``` +pip3 install "Jinja2==3.1.4" "MarkupSafe==3.0.2" +``` +::: +:::{tab-item} RHEL 9.4 +:sync: rhel-9.4 +Download precompiled package: +```{code} sh +wget https://github.com/openvinotoolkit/model_server/releases/download/v2024.5/ovms_redhat.tar.gz +tar -xzvf ovms_redhat.tar.gz +``` +or build it yourself: +```{code} sh +# Clone the model server repository +git clone https://github.com/openvinotoolkit/model_server +cd model_server +# Build docker images (the binary is one of the artifacts) +make docker_build BASE_OS=redhat PYTHON_DISABLE=1 RUN_TESTS=0 +# Unpack the package +tar -xzvf dist/redhat/ovms.tar.gz +``` +Install required libraries: +```{code} sh +sudo yum install compat-openssl11.x86_64 +``` +Set path to the libraries and add binary to the `PATH` +```{code} sh +export LD_LIBRARY_PATH=${PWD}/ovms/lib +export PATH=$PATH;${PWD}/ovms/bin +``` +In case of the build with Python calculators for MediaPipe graphs (PYTHON_DISABLE=0), run also: +```{code} sh +export PYTHONPATH=${PWD}/ovms/lib/python +sudo yum install -y python39-libs +``` + +Additionally, to use text generation, for example, to run [text-generation demo](../demos/continuous_batching/README.md) you need to have `pip` installed and download following dependencies: +``` +pip3 install "Jinja2==3.1.4" "MarkupSafe==3.0.2" +``` +::: +:::{tab-item} Windows +:sync: windows +Make sure you have [Microsoft Visual C++ Redistributable](https://aka.ms/vs/17/release/VC_redist.x64.exe) installed before moving forward. + +Download and unpack model server archive for Windows: + +```bat +curl +tar -xf ovms.zip +``` + +Run `setupvars` script to set required environment variables. + +**Windows Command Line** +```bat +./ovms/setupvars.bat +``` + +**Windows PowerShell** +```powershell +./ovms/setupvars.ps1 +``` + +> **Note**: Running this script changes Python settings for the shell that runs it.Environment variables are set only for the current shell so make sure you rerun the script before using model server in a new shell. + +You can also build model server from source by following the [developer guide](windows_developer_guide.md). + +::: +:::: + +## Test the Deployment + +Download ResNet50 model: +```console +mkdir models/resnet50/1 + +curl -k https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.xml -o models/resnet50/1/model.xml +curl -k https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.bin -o models/resnet50/1/model.bin +``` + +Start the server: +```console +ovms --model_name resnet --model_path models/resnet50 +``` + +or start as a background process, daemon initiated by ```systemctl/initd``` or a Windows service depending on the operating system and specific hosting requirements. + +Most of the Model Server documentation demonstrate containers usage, but the same can be achieved with just the binary package. +Learn more about model server [starting parameters](parameters.md). + +> **NOTE**: +> When serving models on [AI accelerators](accelerators.md), some additional steps may be required to install device drivers and dependencies. +> Learn more in the [Additional Configurations for Hardware](https://docs.openvino.ai/2024/get-started/configurations.html) documentation. + + +## Next Steps + +- [Start the server](starting_server.md) +- Try the model server [features](features.md) +- Explore the model server [demos](../demos/README.md) + +## Additional Resources + +- [Preparing Model Repository](models_repository.md) +- [Using Cloud Storage](using_cloud_storage.md) +- [Troubleshooting](troubleshooting.md) +- [Model server parameters](parameters.md) diff --git a/docs/deploying_server_docker.md b/docs/deploying_server_docker.md new file mode 100644 index 0000000000..e653481122 --- /dev/null +++ b/docs/deploying_server_docker.md @@ -0,0 +1,88 @@ +## Deploying Model Server in Docker Container {#ovms_docs_deploying_server_docker} + +This is a step-by-step guide on how to deploy OpenVINO™ Model Server on Linux, using Docker. + +**Before you start, make sure you have:** + +- [Docker Engine](https://docs.docker.com/engine/) installed +- Intel® Core™ processor (6-13th gen.) or Intel® Xeon® processor (1st to 4th gen.) +- Linux, macOS or Windows via [WSL](https://docs.microsoft.com/en-us/windows/wsl/) +- (optional) AI accelerators [supported by OpenVINO](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Accelerators are tested only on bare-metal Linux hosts. + +### Launch Model Server Container + +This example shows how to launch the model server with a ResNet50 image classification model from a cloud storage: + +#### Step 1. Pull Model Server Image + +Pull an image from Docker: + +```bash +docker pull openvino/model_server:latest +``` + +or [RedHat Ecosystem Catalog](https://catalog.redhat.com/software/containers/intel/openvino-model-server/607833052937385fc98515de): + +``` +docker pull registry.connect.redhat.com/intel/openvino-model-server:latest +``` + +#### Step 2. Prepare Data for Serving + +##### 2.1 Start the container with the model + +```bash +wget https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.{xml,bin} -P models/resnet50/1 +docker run -u $(id -u) -v $(pwd)/models:/models -p 9000:9000 openvino/model_server:latest \ +--model_name resnet --model_path /models/resnet50 \ +--layout NHWC:NCHW --port 9000 +``` + +##### 2.2 Download input files: an image and a label mapping file + +```bash +wget https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/static/images/zebra.jpeg +wget https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/python/classes.py +``` + +##### 2.3 Install the Python-based ovmsclient package + +```bash +pip3 install ovmsclient +``` + + +#### Step 3. Run Prediction + + +```bash +echo 'import numpy as np +from classes import imagenet_classes +from ovmsclient import make_grpc_client + +client = make_grpc_client("localhost:9000") + +with open("zebra.jpeg", "rb") as f: + img = f.read() + +output = client.predict({"0": img}, "resnet") +result_index = np.argmax(output[0]) +print(imagenet_classes[result_index])' >> predict.py + +python predict.py +zebra +``` +If everything is set up correctly, you will see 'zebra' prediction in the output. + +### Build Image From Source + +In case you want to try out features that have not been released yet, you can build the image from source code yourself. +```bash +git clone https://github.com/openvinotoolkit/model_server.git +cd model_server +make release_image GPU=1 +``` +It will create an image called `openvino/model_server:latest`. +> **Note:** This operation might take 40min or more depending on your build host. +> **Note:** `GPU` parameter in image build command is needed to include dependencies for GPU device. +> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image. \ No newline at end of file diff --git a/docs/deploying_server_kubernetes.md b/docs/deploying_server_kubernetes.md new file mode 100644 index 0000000000..e48c266395 --- /dev/null +++ b/docs/deploying_server_kubernetes.md @@ -0,0 +1,21 @@ +## Deploying Model Server in Kubernetes {#ovms_docs_deploying_server_kubernetes} + +There are three recommended methods for deploying OpenVINO Model Server in Kubernetes: +1. [helm chart](https://github.com/openvinotoolkit/operator/tree/main/helm-charts/ovms) - deploys Model Server instances using the [helm](https://helm.sh) package manager for Kubernetes +2. [Kubernetes Operator](https://operatorhub.io/operator/ovms-operator) - manages Model Server using a Kubernetes Operator +3. [OpenShift Operator](https://github.com/openvinotoolkit/operator/blob/main/docs/operator_installation.md#openshift) - manages Model Server instances in Red Hat OpenShift + +For operators mentioned in 2. and 3. see the [description of the deployment process](https://github.com/openvinotoolkit/operator/blob/main/docs/modelserver.md) + +## Next Steps + +- [Start the server](starting_server.md) +- Try the model server [features](features.md) +- Explore the model server [demos](../demos/README.md) + +## Additional Resources + +- [Preparing Model Repository](models_repository.md) +- [Using Cloud Storage](using_cloud_storage.md) +- [Troubleshooting](troubleshooting.md) +- [Model server parameters](parameters.md) diff --git a/docs/llm/quickstart.md b/docs/llm/quickstart.md index d8d945db1f..d57baba4eb 100644 --- a/docs/llm/quickstart.md +++ b/docs/llm/quickstart.md @@ -3,24 +3,40 @@ Let's deploy [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) model and request generation. 1. Install python dependencies for the conversion script: -```bash +```console pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt ``` 2. Run optimum-cli to download and quantize the model: -```bash -wget https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py +```console +curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py mkdir models python export_model.py text_generation --source_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models ``` 3. Deploy: +:::{dropdown} With Docker + +> Required: Docker Engine installed + ```bash docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server --rest_port 8000 --config_path /workspace/config.json ``` +::: + +:::{dropdown} On Baremetal Host + +> Required: OpenVINO Model Server package - see [deployment instruction](../deploying_server_baremetal.md) for details. + +```bat +ovms --rest_port 8000 --config_path ./models/config.json +``` +::: + +4. Check readiness Wait for the model to load. You can check the status with a simple command: -```bash +```console curl http://localhost:8000/v1/config ``` ```json @@ -39,8 +55,9 @@ curl http://localhost:8000/v1/config } } ``` -6. Run generation -```bash + +5. Run generation +```console curl -s http://localhost:8000/v3/chat/completions \ -H "Content-Type: application/json" \ -d '{ diff --git a/setupvars.bat b/setupvars.bat new file mode 100644 index 0000000000..16e4cde16c --- /dev/null +++ b/setupvars.bat @@ -0,0 +1,20 @@ +:: +:: Copyright (c) 2024 Intel Corporation +:: +:: Licensed under the Apache License, Version 2.0 (the "License"); +:: you may not use this file except in compliance with the License. +:: You may obtain a copy of the License at +:: +:: http:::www.apache.org/licenses/LICENSE-2.0 +:: +:: Unless required by applicable law or agreed to in writing, software +:: distributed under the License is distributed on an "AS IS" BASIS, +:: WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +:: See the License for the specific language governing permissions and +:: limitations under the License. +:: +@echo off +set "OVMS_DIR=%~dp0" +set "PYTHONHOME=%OVMS_DIR%\python" +set "PATH=%OVMS_DIR%;%PYTHONHOME%;%PATH%" +echo "OpenVINO Model Server Environment Initialized" diff --git a/setupvars.ps1 b/setupvars.ps1 new file mode 100644 index 0000000000..1faf3c8c29 --- /dev/null +++ b/setupvars.ps1 @@ -0,0 +1,20 @@ +# +# Copyright (c) 2024 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http//:www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +$env:OVMS_DIR=$PSScriptRoot +$env:PYTHONHOME="$env:OVMS_DIR\python" +$env:PATH="$env:OVMS_DIR;$env:PYTHONHOME;$env:PATH" +echo "OpenVINO Model Server Environment Initialized" diff --git a/windows_create_package.bat b/windows_create_package.bat index ab73bdd62d..de96858bfb 100644 --- a/windows_create_package.bat +++ b/windows_create_package.bat @@ -45,10 +45,13 @@ set "python_version=3.9.13" call %cd%\windows_prepare_python.bat %dest_dir% %python_version% :: Copy whole catalog to dist folder and install dependencies required by LLM pipelines xcopy %dest_dir%\python-%python_version%-embed-amd64 dist\windows\ovms\python /E /I /H -.\dist\windows\ovms\python\python.exe -m pip install "Jinja2==3.1.4" "MarkupSafe==3.0.2" if !errorlevel! neq 0 ( echo Error copying python into the distribution location. The package will not contain self-contained python. ) +.\dist\windows\ovms\python\python.exe -m pip install "Jinja2==3.1.4" "MarkupSafe==3.0.2" +if !errorlevel! neq 0 ( + echo Error during Python dependencies for LLM installation. The package will not be fully functional. +) :: Below includes OpenVINO tokenizers :: TODO Better manage dependency declaration with llm_engine & bazel @@ -60,6 +63,9 @@ if !errorlevel! neq 0 exit /b !errorlevel! copy %cd%\bazel-out\x64_windows-opt\bin\src\opencv_world4100.dll dist\windows\ovms if !errorlevel! neq 0 exit /b !errorlevel! +copy %cd%\setupvars.* dist\windows\ovms +if !errorlevel! neq 0 exit /b !errorlevel! + dist\windows\ovms\ovms.exe --version if !errorlevel! neq 0 exit /b !errorlevel! diff --git a/windows_prepare_python.bat b/windows_prepare_python.bat index c9b87acd59..00222b2477 100644 --- a/windows_prepare_python.bat +++ b/windows_prepare_python.bat @@ -69,7 +69,7 @@ echo .\Lib\site-packages if !errorlevel! neq 0 exit /b !errorlevel! :: Install pip -curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py +curl -k https://bootstrap.pypa.io/get-pip.py -o get-pip.py if !errorlevel! neq 0 exit /b !errorlevel! .\python.exe get-pip.py if !errorlevel! neq 0 exit /b !errorlevel!