Skip to content

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

@sriram-dsl

Description

@sriram-dsl

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

Environment

  • Operating System: Ubuntu 24.04
  • OpenVINO Version: openvino/model_server:latest (Docker container)
  • Hardware: Intel Core 12th Gen Intel(R) Core(TM) i3-1220P
  • Models: YOLOv5 models converted to OpenVINO IR format (.xml and .bin), FP32 precision
  • Deployment: Docker container with OpenVINO Model Server

Issue

I deployed the OpenVINO Model Server container with a single YOLOv5 model (FP32 precision) and observed inference times of 8-20 milliseconds per request, which is acceptable. However, when I load 4 YOLOv5 models on the same server, the inference time spikes to 30-100 milliseconds per model request. This significant increase in latency occurs despite using parallelism in my client script (via ThreadPoolExecutor) and setting "nireq": 4 per model in the server configuration.

This spike leads to higher hardware resource usage (e.g., CPU/GPU contention) and impacts real-time performance. I expected multi-model inference to maintain closer to single-model latency with proper resource allocation, especially given OpenVINO's support for parallel inference.

Logs

Single Model (model2)

[2025-03-20 17:30:41.135] Prediction duration in model model2, version 1, nireq 0: 15.680 ms
[2025-03-20 17:30:41.135] Total gRPC request processing time: 15.861 ms
[2025-03-20 17:30:41.266] Prediction duration in model model2, version 1, nireq 0: 24.077 ms
[2025-03-20 17:30:41.266] Total gRPC request processing time: 24.306 ms
[2025-03-20 17:30:41.383] Prediction duration in model model2, version 1, nireq 0: 15.227 ms
[2025-03-20 17:30:41.383] Total gRPC request processing time: 15.452 ms

Multi-Model (4 models loaded)

[2025-03-20 18:17:15.523] Prediction duration in model model1, version 1, nireq 0: 42.076 ms
[2025-03-20 18:17:15.523] Total gRPC request processing time: 42.317 ms
[2025-03-20 18:17:15.530] Prediction duration in model model2, version 1, nireq 0: 46.367 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 46.606 ms
[2025-03-20 18:17:15.530] Prediction duration in model model3, version 1, nireq 0: 45.479 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 45.68 ms
[2025-03-20 18:17:15.514] Prediction duration in model model4, version 1, nireq 0: 27.955 ms
[2025-03-20 18:17:15.514] Total gRPC request processing time: 28.175 ms

Configuration

  • Docker Command:
    sudo docker run -d --shm-size=23g --ulimit memlock=-1 --ulimit stack=67108864 --name openvino_model_server -v /home/ubuntu/models:/models -p 900:9000 -p 811:8000 openvino/model_server:latest --config_path /models/config.json --port 9000 --rest_port 8000 --metrics_enable --log_level DEBUG
    
    
  • config.json:
    {
    "model_config_list": [
      {"name": "model1", "base_path": "/models/model1", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
      {"name": "model2", "base_path": "/models/model2", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
      {"name": "model3", "base_path": "/models/model3", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
      {"name": "model4", "base_path": "/models/model4", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}}
    ]

}


Steps to Reproduce

  1. Deploy OpenVINO Model Server with a single YOLOv5 model (FP32) using the above command and a config.json containing only model2.
  2. Send gRPC inference requests (e.g., via ovmsclient) and measure latency from logs or metrics endpoint (http://localhost:811/metrics).
  3. Update config.json to include 4 YOLOv5 models (model1, model2, model3, model4).
  4. Restart the container and send parallel gRPC requests for all 4 models using a Python script with ThreadPoolExecutor.
  5. Compare inference times from logs.

Expected Behavior

With 4 models loaded and parallel inference enabled (nireq=4), I expect inference times to remain close to single-model performance (e.g., 20-30 ms total latency across all models), leveraging OpenVINO's multi-stream capabilities and parallel execution

Actual Behavior

Inference time per model increases significantly (30-100 ms per request), indicating resource contention or inefficient multi-model handling. For example, model2 jumps from 15-24 ms (single model) to 46.367 ms (multi-model).

Suggestions for optimizing resource allocation or server configuration to maintain low latency with multiple YOLOv5 models would be greatly appreciated.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions