-
Notifications
You must be signed in to change notification settings - Fork 225
Description
Significant Inference Time Increase with Multiple Models in OpenVINO Model Server
Environment
- Operating System: Ubuntu 24.04
- OpenVINO Version:
openvino/model_server:latest
(Docker container) - Hardware: Intel Core 12th Gen Intel(R) Core(TM) i3-1220P
- Models: YOLOv5 models converted to OpenVINO IR format (
.xml
and.bin
), FP32 precision - Deployment: Docker container with OpenVINO Model Server
Issue
I deployed the OpenVINO Model Server container with a single YOLOv5 model (FP32 precision) and observed inference times of 8-20 milliseconds per request, which is acceptable. However, when I load 4 YOLOv5 models on the same server, the inference time spikes to 30-100 milliseconds per model request. This significant increase in latency occurs despite using parallelism in my client script (via ThreadPoolExecutor
) and setting "nireq": 4
per model in the server configuration.
This spike leads to higher hardware resource usage (e.g., CPU/GPU contention) and impacts real-time performance. I expected multi-model inference to maintain closer to single-model latency with proper resource allocation, especially given OpenVINO's support for parallel inference.
Logs
Single Model (model2
)
[2025-03-20 17:30:41.135] Prediction duration in model model2, version 1, nireq 0: 15.680 ms
[2025-03-20 17:30:41.135] Total gRPC request processing time: 15.861 ms
[2025-03-20 17:30:41.266] Prediction duration in model model2, version 1, nireq 0: 24.077 ms
[2025-03-20 17:30:41.266] Total gRPC request processing time: 24.306 ms
[2025-03-20 17:30:41.383] Prediction duration in model model2, version 1, nireq 0: 15.227 ms
[2025-03-20 17:30:41.383] Total gRPC request processing time: 15.452 ms
Multi-Model (4 models loaded)
[2025-03-20 18:17:15.523] Prediction duration in model model1, version 1, nireq 0: 42.076 ms
[2025-03-20 18:17:15.523] Total gRPC request processing time: 42.317 ms
[2025-03-20 18:17:15.530] Prediction duration in model model2, version 1, nireq 0: 46.367 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 46.606 ms
[2025-03-20 18:17:15.530] Prediction duration in model model3, version 1, nireq 0: 45.479 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 45.68 ms
[2025-03-20 18:17:15.514] Prediction duration in model model4, version 1, nireq 0: 27.955 ms
[2025-03-20 18:17:15.514] Total gRPC request processing time: 28.175 ms
Configuration
- Docker Command:
sudo docker run -d --shm-size=23g --ulimit memlock=-1 --ulimit stack=67108864 --name openvino_model_server -v /home/ubuntu/models:/models -p 900:9000 -p 811:8000 openvino/model_server:latest --config_path /models/config.json --port 9000 --rest_port 8000 --metrics_enable --log_level DEBUG
- config.json:
{ "model_config_list": [ {"name": "model1", "base_path": "/models/model1", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}}, {"name": "model2", "base_path": "/models/model2", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}}, {"name": "model3", "base_path": "/models/model3", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}}, {"name": "model4", "base_path": "/models/model4", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}} ]
}
Steps to Reproduce
- Deploy OpenVINO Model Server with a single YOLOv5 model (FP32) using the above command and a
config.json
containing onlymodel2
. - Send gRPC inference requests (e.g., via ovmsclient) and measure latency from logs or metrics endpoint
(http://localhost:811/metrics)
. - Update config.json to include 4 YOLOv5 models
(model1, model2, model3, model4)
. - Restart the container and send parallel gRPC requests for all 4 models using a Python script with ThreadPoolExecutor.
- Compare inference times from logs.
Expected Behavior
With 4 models loaded and parallel inference enabled (nireq=4), I expect inference times to remain close to single-model performance (e.g., 20-30 ms total latency across all models), leveraging OpenVINO's multi-stream capabilities and parallel execution
Actual Behavior
Inference time per model increases significantly (30-100 ms per request), indicating resource contention or inefficient multi-model handling. For example, model2 jumps from 15-24 ms (single model) to 46.367 ms (multi-model).
Suggestions for optimizing resource allocation or server configuration to maintain low latency with multiple YOLOv5 models would be greatly appreciated.
Thanks in advance!