Skip to content

Conversation

@BenjaminBraunDev
Copy link
Contributor

@BenjaminBraunDev BenjaminBraunDev commented Nov 7, 2025

This PR is stage 3/3 for adding in the latency prediction and SLO-Aware Routing functionality to EPP.

New Features:

  1. New -enable-latency-predictor flag in EPP arg to inform it that sidecars are present and to register slo routing plugins.
  2. Requests now support latency SLOs in their header (x-slo-ttft-ms and x-slo-tpot-ms) and a boolean for whether to use the SLO routing scheduling profile with slo scoring (x-prediction-based-scheduling). If false, use the default profile and just track and train for future requests.

Plugins

Registers and deploys the plugins added in #1849 via scheduling profiles:

  1. The requestcontrol plugins that track requests and their latencies and sends data back to training sidecar for processing.
  2. A Scorer plugin based on latencies and headroom.

PodMetrics

Adds (back) the totalRunningRequestsMetric prometheus metric from vLLM, which was removed for being unused in the past, but is now a feature of our latency prediction model.

Guide

Added a guide for how to deploy IGW with SLO-Aware Routing in site-src/guides/slo-aware-routing.md

Fixes #1323

Does this PR introduce a user-facing change?:

Yes:
EPP has a new runtime/deployment argument: `-enable-latency-predictor`

When enabled EPP, will register and support the new SLO routing plugins and sidecars. With this flag, EPP will assume the predictor and training sidecars also exist in its same container.

When latency prediction is enabled for EPP, requests support 3 new headers:
- `x-slo-ttft-ms` is a float for the desired "Time To First Token" SLO
- `x-slo-tpot-ms` is a float for the desired average "Time Per Output Token" SLO
- `x-prediction-based-scheduling` is a boolean for whether to route this request with the default scheduling profile or with the slo-aware scheduling profile

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BenjaminBraunDev
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Nov 7, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit ddee4c7
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/691fbc7c1b291f000811303c
😎 Deploy Preview https://deploy-preview-1839--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 7, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and elevran November 7, 2025 23:56
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 7, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @BenjaminBraunDev. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 7, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Nov 14, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 14, 2025
@BenjaminBraunDev
Copy link
Contributor Author

BenjaminBraunDev commented Nov 15, 2025

@ahg-g Due to it's large nature, Kellen requested we split this PR up. Here is the first one: #1849

@BenjaminBraunDev BenjaminBraunDev force-pushed the slo-aware-routing-stage-3 branch from 2c56616 to f63bc01 Compare November 20, 2025 21:42
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 20, 2025
@BenjaminBraunDev BenjaminBraunDev changed the title SLO Aware Routing Plugins and Metrics SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment Nov 20, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 20, 2025
@BenjaminBraunDev BenjaminBraunDev force-pushed the slo-aware-routing-stage-3 branch from d177545 to ddee4c7 Compare November 21, 2025 01:12
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 21, 2025
COPY apix ./apix
COPY api ./api
COPY version ./version
COPY sidecars ./sidecars
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be called sidecars if its just a go interface for the predictor API


// Latency Predictor Flag
enableLatencyPredictor = flag.Bool("enable-latency-predictor", false, "Enable the regression-based latency predictor and scheduler scorer.")
tracing = flag.Bool("tracing", true, "Enables emitting traces")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move tracing above since its not strictly just for the latency predictor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Director: director,
SaturationDetector: saturationDetector,
UseExperimentalDatalayerV2: r.featureGates[datalayer.FeatureGate], // pluggable data layer feature flag
LatencyPredictor: predictor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is used anywhere, could be an artifact before everything was transitioned to the plugin format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's not, will remove


### Install with SLO-Aware Routing

For full details see the dedicated [SLO-Aware Routing Guide](../../../site-src/guides/slo-aware-routing.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is gonna be a bad link, site-src is the source of truth for our static website: https://gateway-api-inference-extension.sigs.k8s.io/

Your PR comes with a preview: https://deploy-preview-1839--gateway-api-inference-extension.netlify.app/ so you should be able to validate the correct URL pathing that way

volumeMounts:
- name: plugins-config-volume
mountPath: "/config"
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trusting that this has been manually validated to work both on & off.

Helm config is difficult to review as is, and this is a rather large block of it

# port: 8081
# protocol: TCP
# targetPort: 8081
# extraServicePorts:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this formatting needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just an artifact. Will fix.

github.com/elastic/crd-ref-docs v0.2.0
github.com/envoyproxy/go-control-plane/envoy v1.36.0
github.com/go-logr/logr v1.4.3
github.com/go-logr/zapr v1.3.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are using/importing a new logger?

PodList(predicate func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics
PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool
PodDelete(podNAme string)
PodDelete(podName string)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty

Headers: reqCtx.Response.Headers,
RequestId: reqCtx.Request.Headers[requtil.RequestIdHeaderKey],
Headers: reqCtx.Response.Headers,
EndOfStream: reqCtx.ResponseComplete,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Would like to avoid the director being aware of the ext-proc details if possible, or if not, the EoS should signal that the response complete plugin should run I would think


### Next Steps: Advanced Features

You have now deployed a basic Inference Gateway with a simple routing strategy. To explore more advanced features, such as SLO-aware routing, please refer to the following guide:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets call this latency-based predictor or some other name,

Also, pls add to the guide sidebar like the other guides, otherwise its hidden in the getting started guide, and that is the only discoverable link (other than just knowing the URL)

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if you can send a separate PR for adding the running requests metric

tracing = flag.Bool("tracing", true, "Enables emitting traces")

// Latency Predictor Flag
enableLatencyPredictor = flag.Bool("enable-latency-predictor", false, "Enable the regression-based latency predictor and scheduler scorer.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an explicit flag? isn't the predictor a plugin? and so the plugins configuration should take care of enablement.

}

rawConfig, err := r.parseConfigurationPhaseOne(ctx)
// ===================================================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this initialization should be part of the predictor plugin initialization itself. We should not have any predictor specific logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, will move this


The behavior of the SLO-aware router can be fine-tuned using the following environment variables in the Endpoint Picker deployment. These can be set under `inferenceExtension.env` in your `values.yaml` file.

| Environment Variable | Description | Default |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are those set as env vars instead of a plugin specific configuration parameters?

{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
- type: slo-aware-routing
- type: slo-aware-profile-handler
- type: max-score-picker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this plugin is added by default, please remove it

- name: default
plugins:
- pluginRef: slo-aware-routing
weight: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding this plugin if the weight is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to track requests and gather latency data but, we don't want to use it for routing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a design smell

fieldRef:
fieldPath: metadata.name
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
- name: PREDICTION_SERVER_URL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any of those new env vars, those should be exposed as a plugin configuration with a default that just works.

mountPath: "/config"
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
# Training Server Sidecar Container
- name: training-server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define this sidecar somewhere else and we embed it here, inlining this whole thing directly here makes it hard to read

- name: http-metrics
protocol: TCP
port: {{ .Values.inferenceExtension.metricsPort | default 9090 }}
{{- if .Values.inferenceExtension.latencyPredictor.enabled }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The predictor is running as a sidecar and will only be contacted by the epp via localhost, we should not need to expose it here. in other words, I don't expect any changes to the service yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some metrics that the sidecar emits to EPP and in order to read those from the metrics port on the service we need to expose the port (@kaushikmitr is using this for a dashboard)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EPP can read the metrics via localhost, why does it need to go through service for that?

}
}

if p.MetricMapping.TotalRunningRequests != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, adding scraping for this metric should have been done in a separate PR.

"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/handlers"
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/requestcontrol"
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/saturationdetector"
latencypredictor "sigs.k8s.io/gateway-api-inference-extension/sidecars/latencypredictorasync"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, all predictor related logic should be confined in its plugin; if this is not possible now, then this tells me we have a missing extension point in our framework.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's being used at all. See: #1839 (comment)

@ahg-g
Copy link
Contributor

ahg-g commented Nov 22, 2025

@kaushikmitr @BenjaminBraunDev this is predicted latency, not slo, right? if so, please use predicted-latency-scorer as the name of the plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat] SLO-Aware Routing with Latency Prediction

5 participants