Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

# Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

## Environment
- **Operating System**: Ubuntu 24.04
- **OpenVINO Version**: `openvino/model_server:latest` (Docker container)
- **Hardware**: Intel Core 12th Gen Intel(R) Core(TM) i3-1220P
- **Models**: YOLOv5 models converted to OpenVINO IR format (`.xml` and `.bin`), FP32 precision
- **Deployment**: Docker container with OpenVINO Model Server

## Issue
I deployed the OpenVINO Model Server container with a single YOLOv5 model (FP32 precision) and observed inference times of **8-20 milliseconds** per request, which is acceptable. However, when I load **4 YOLOv5 models** on the same server, the inference time spikes to **30-100 milliseconds** per model request. This significant increase in latency occurs despite using parallelism in my client script (via `ThreadPoolExecutor`) and setting `"nireq": 4` per model in the server configuration.

This spike leads to higher hardware resource usage (e.g., CPU/GPU contention) and impacts real-time performance. I expected multi-model inference to maintain closer to single-model latency with proper resource allocation, especially given OpenVINO's support for parallel inference.

## Logs

### Single Model (`model2`)
```
[2025-03-20 17:30:41.135] Prediction duration in model model2, version 1, nireq 0: 15.680 ms
[2025-03-20 17:30:41.135] Total gRPC request processing time: 15.861 ms
[2025-03-20 17:30:41.266] Prediction duration in model model2, version 1, nireq 0: 24.077 ms
[2025-03-20 17:30:41.266] Total gRPC request processing time: 24.306 ms
[2025-03-20 17:30:41.383] Prediction duration in model model2, version 1, nireq 0: 15.227 ms
[2025-03-20 17:30:41.383] Total gRPC request processing time: 15.452 ms
```


### Multi-Model (4 models loaded)

```
[2025-03-20 18:17:15.523] Prediction duration in model model1, version 1, nireq 0: 42.076 ms
[2025-03-20 18:17:15.523] Total gRPC request processing time: 42.317 ms
[2025-03-20 18:17:15.530] Prediction duration in model model2, version 1, nireq 0: 46.367 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 46.606 ms
[2025-03-20 18:17:15.530] Prediction duration in model model3, version 1, nireq 0: 45.479 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 45.68 ms
[2025-03-20 18:17:15.514] Prediction duration in model model4, version 1, nireq 0: 27.955 ms
[2025-03-20 18:17:15.514] Total gRPC request processing time: 28.175 ms
```


## Configuration
- **Docker Command**:
  ```bash
  sudo docker run -d --shm-size=23g --ulimit memlock=-1 --ulimit stack=67108864 --name openvino_model_server -v /home/ubuntu/models:/models -p 900:9000 -p 811:8000 openvino/model_server:latest --config_path /models/config.json --port 9000 --rest_port 8000 --metrics_enable --log_level DEBUG


- **config.json**:
  ```bash
  {
  "model_config_list": [
    {"name": "model1", "base_path": "/models/model1", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
    {"name": "model2", "base_path": "/models/model2", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
    {"name": "model3", "base_path": "/models/model3", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
    {"name": "model4", "base_path": "/models/model4", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}}
  ]
}

---

### Steps to Reproduce
1. Deploy OpenVINO Model Server with a single YOLOv5 model (FP32) using the above command and a `config.json` containing only `model2`.
2. Send gRPC inference requests (e.g., via ovmsclient) and measure latency from logs or metrics endpoint `(http://localhost:811/metrics)`.
3. Update config.json to include 4 YOLOv5 models `(model1, model2, model3, model4)`.
4. Restart the container and send parallel gRPC requests for all 4 models using a Python script with ThreadPoolExecutor.
5. Compare inference times from logs.


### Expected Behavior
With 4 models loaded and parallel inference enabled (nireq=4), I expect inference times to remain close to single-model performance (e.g., 20-30 ms total latency across all models), leveraging OpenVINO's multi-stream capabilities and parallel execution

### Actual Behavior
Inference time per model increases significantly (30-100 ms per request), indicating resource contention or inefficient multi-model handling. For example, model2 jumps from 15-24 ms (single model) to 46.367 ms (multi-model).


Suggestions for optimizing resource allocation or server configuration to maintain low latency with multiple YOLOv5 models would be greatly appreciated.

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

Environment

Issue

Logs

Single Model (`model2`)

Multi-Model (4 models loaded)

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

Description

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

Environment

Issue

Logs

Single Model (model2)

Multi-Model (4 models loaded)

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Single Model (`model2`)