-
Notifications
You must be signed in to change notification settings - Fork 202
Description
Hi,
We've been using MLServer to serve our models and we are using the parallel inference feature. So we have multiple instances of our model inside the same kubernetes pod running on different workers.
We have OTel traces being reported inside our model's code, here is a simplified code snippet to illustrate the issue.
from mlflow.pyfunc import PythonModel
from opentelemetry import trace
class CustomModel(PythonModel):
def __init__(self, model: MyCustomModelClass):
self.__model = model
@property
def model(self):
return self.__model
def predict(self, context, input_data: pd.DataFrame) -> pd.DataFrame:
with trace.get_tracer(__name__).start_as_current_span("predict") as trace_span:
trace_span.set_attribute('input_data', dataframe_to_json(input_data))
with trace.get_tracer(__name__).start_as_current_span("heavy-operation") as trace_span:
output_data = self.__model.run_heavy_operation()
trace_span.set_attribute('output_data', dataframe_to_json(output_data))
return output_data
This approach works if you run the model inside the same context as the MLServer instance. But we have isolated environments for the MLServer instance and the model itself. And we are also running the model with parallel inference, so it's running on it's own spawned process.
This leads us to a couple of problems.
The Opentelemetry span exporter is not being instantiated inside the model context. I could this on my model's code to get it up and running. The environment variables are being shared with the spawned process, so no problem there.
def load_context(self, context: PythonModelContext):
# OpenTelemetry instrumentation
tracer = TracerProvider(
resource=Resource.create({"service.name": "my-ml-model"}),
)
tracer.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tracer)
But there's a major problem with the parent trace id. When we are receiving an inference request to our pod there's a parent trace id, which is being used in the FastAPI side and the spans are being reported within the same trace without any problem. We got the spans from the FastAPI server running on MLServer.
To have proper instrumentation inside the model itself, on top of instantiating the span exporter we should propagate that parent trace id from the request. That would follow this route: FastAPI -> Dispatcher -> Worker -> Model (inside the model registry) and should be added as a kwarg or something into the multiprocessing queues within the ModelRequest.
Something like this open-telemetry/opentelemetry-specification#740. But here they are talking about a process spawning a new one and propagating the parent trace id to the child process. But here we have a different problem, because the dispatcher is receiving N parent trace ids and needs to propagate just 1 of them for each ModelRequest for the model to be able to register that span with the right trace id.