Skip to content

[BUG] TritonServer start fails to load after Initializing QueryFaiss #760

@vs385

Description

@vs385

Tried running this notebook example:

  • Building-and-deploying-multi-stage-RecSys

When I reach the point to start the server:
tritonserver --model-repository=/ensemble_export_path/ --backend-config=tensorflow,version=2
from the terminal (which I open adjacently in jupyterhub while the notebook is running), the terminal gets stuck and stops loading anything after the below lines:

2022-12-07 17:58:42.661940: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-07 17:58:42.662552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 19610 MB memory: -> device: 1, name: NVIDIA A10G, pci bus id: 0000:00:1c.0, compute capability: 8.6 2022-12-07 17:58:42.662612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-07 17:58:42.663241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 19610 MB memory: -> device: 2, name: NVIDIA A10G, pci bus id: 0000:00:1d.0, compute capability: 8.6 2022-12-07 17:58:42.663299: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-07 17:58:42.663917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 19610 MB memory: -> device: 3, name: NVIDIA A10G, pci bus id: 0000:00:1e.0, compute capability: 8.6 2022-12-07 17:58:42.672242: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle. 2022-12-07 17:58:42.728537: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /Merlin/examples/Building-and-deploying-multi-stage-RecSys/poc_ensemble/1_predicttensorflow/1/model.savedmodel 2022-12-07 17:58:42.752115: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 106507 microseconds. I1207 22:58:42.752287 1485 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: 0_queryfeast (GPU device 1) I1207 22:58:45.055801 1485 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: 2_queryfaiss (GPU device 0)

I'm running on the following:
Merlin version: nacre.io/nvidia/merlin/merlin-tensorflow:22.10
Running on an ec2 g5 instance
Python version: 3.8.10
Tensorflow version (GPU): tensor flow 2.9.1+nv22.8

Faiss-gpu installed:
faiss 1.7.2
faiss-gpu 1.7.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions