Skip to content

Sending OTel traces from the model's code with parallel inference enabled #2052

@ncapilla

Description

@ncapilla

Hi,

We've been using MLServer to serve our models and we are using the parallel inference feature. So we have multiple instances of our model inside the same kubernetes pod running on different workers.

We have OTel traces being reported inside our model's code, here is a simplified code snippet to illustrate the issue.

from mlflow.pyfunc import PythonModel
from opentelemetry import trace

class CustomModel(PythonModel):

    def __init__(self, model: MyCustomModelClass):
        self.__model = model

    @property
    def model(self):
        return self.__model
    
    def predict(self, context, input_data: pd.DataFrame) -> pd.DataFrame:
        with trace.get_tracer(__name__).start_as_current_span("predict") as trace_span:
            trace_span.set_attribute('input_data', dataframe_to_json(input_data))

            with trace.get_tracer(__name__).start_as_current_span("heavy-operation") as trace_span:
                output_data = self.__model.run_heavy_operation()

            trace_span.set_attribute('output_data', dataframe_to_json(output_data))
            return output_data

This approach works if you run the model inside the same context as the MLServer instance. But we have isolated environments for the MLServer instance and the model itself. And we are also running the model with parallel inference, so it's running on it's own spawned process.

This leads us to a couple of problems.

The Opentelemetry span exporter is not being instantiated inside the model context. I could this on my model's code to get it up and running. The environment variables are being shared with the spawned process, so no problem there.

    def load_context(self, context: PythonModelContext):
        # OpenTelemetry instrumentation
        tracer = TracerProvider(
            resource=Resource.create({"service.name": "my-ml-model"}),
        )
        tracer.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
        trace.set_tracer_provider(tracer)

But there's a major problem with the parent trace id. When we are receiving an inference request to our pod there's a parent trace id, which is being used in the FastAPI side and the spans are being reported within the same trace without any problem. We got the spans from the FastAPI server running on MLServer.

To have proper instrumentation inside the model itself, on top of instantiating the span exporter we should propagate that parent trace id from the request. That would follow this route: FastAPI -> Dispatcher -> Worker -> Model (inside the model registry) and should be added as a kwarg or something into the multiprocessing queues within the ModelRequest.

Something like this open-telemetry/opentelemetry-specification#740. But here they are talking about a process spawning a new one and propagating the parent trace id to the child process. But here we have a different problem, because the dispatcher is receiving N parent trace ids and needs to propagate just 1 of them for each ModelRequest for the model to be able to register that span with the right trace id.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions