SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment #1839

BenjaminBraunDev · 2025-11-07T23:56:26Z

This PR is stage 3/3 for adding in the latency prediction and SLO-Aware Routing functionality to EPP.

New Features:

New -enable-latency-predictor flag in EPP arg to inform it that sidecars are present and to register slo routing plugins.
Requests now support latency SLOs in their header (x-slo-ttft-ms and x-slo-tpot-ms) and a boolean for whether to use the SLO routing scheduling profile with slo scoring (x-prediction-based-scheduling). If false, use the default profile and just track and train for future requests.

Plugins

Registers and deploys the plugins added in #1849 via scheduling profiles:

The requestcontrol plugins that track requests and their latencies and sends data back to training sidecar for processing.
A Scorer plugin based on latencies and headroom.

PodMetrics

Adds (back) the totalRunningRequestsMetric prometheus metric from vLLM, which was removed for being unused in the past, but is now a feature of our latency prediction model.

Guide

Added a guide for how to deploy IGW with SLO-Aware Routing in site-src/guides/slo-aware-routing.md

Fixes #1323

Does this PR introduce a user-facing change?:

Yes:
EPP has a new runtime/deployment argument: `-enable-latency-predictor`

When enabled EPP, will register and support the new SLO routing plugins and sidecars. With this flag, EPP will assume the predictor and training sidecars also exist in its same container.

When latency prediction is enabled for EPP, requests support 3 new headers:
- `x-slo-ttft-ms` is a float for the desired "Time To First Token" SLO
- `x-slo-tpot-ms` is a float for the desired average "Time Per Output Token" SLO
- `x-prediction-based-scheduling` is a boolean for whether to route this request with the default scheduling profile or with the slo-aware scheduling profile

k8s-ci-robot · 2025-11-07T23:56:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BenjaminBraunDev
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-11-07T23:56:33Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`ddee4c7`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/691fbc7c1b291f000811303c
😎 Deploy Preview	https://deploy-preview-1839--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-11-07T23:56:36Z

Hi @BenjaminBraunDev. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ahg-g · 2025-11-14T22:19:26Z

/ok-to-test

BenjaminBraunDev · 2025-11-15T01:22:10Z

@ahg-g Due to it's large nature, Kellen requested we split this PR up. Here is the first one: #1849

config/charts/inferencepool/values.yaml

…diciton id off

…te, add predictor check to the beginning of each requestcontrol hook

…, add predictor to new 2 phase configuration parser

kfswain · 2025-11-21T19:39:04Z

Dockerfile

 COPY apix ./apix
 COPY api ./api
 COPY version ./version
+COPY sidecars ./sidecars


I wonder if this should be called sidecars if its just a go interface for the predictor API

kfswain · 2025-11-21T19:39:52Z

cmd/epp/runner/runner.go

+
+	// Latency Predictor Flag
+	enableLatencyPredictor = flag.Bool("enable-latency-predictor", false, "Enable the regression-based latency predictor and scheduler scorer.")
+	tracing                = flag.Bool("tracing", true, "Enables emitting traces")


nit: move tracing above since its not strictly just for the latency predictor

kfswain · 2025-11-21T19:42:29Z

cmd/epp/runner/runner.go

 		Director:                         director,
 		SaturationDetector:               saturationDetector,
 		UseExperimentalDatalayerV2:       r.featureGates[datalayer.FeatureGate], // pluggable data layer feature flag
+		LatencyPredictor:                 predictor,


I don't think this is used anywhere, could be an artifact before everything was transitioned to the plugin format?

yeah it's not, will remove

kfswain · 2025-11-21T19:45:45Z

config/charts/inferencepool/README.md


+### Install with SLO-Aware Routing
+
+For full details see the dedicated [SLO-Aware Routing Guide](../../../site-src/guides/slo-aware-routing.md)


this is gonna be a bad link, site-src is the source of truth for our static website: https://gateway-api-inference-extension.sigs.k8s.io/

Your PR comes with a preview: https://deploy-preview-1839--gateway-api-inference-extension.netlify.app/ so you should be able to validate the correct URL pathing that way

kfswain · 2025-11-21T19:47:46Z

config/charts/inferencepool/templates/epp-deployment.yaml

        volumeMounts:
        - name: plugins-config-volume
          mountPath: "/config"
+      {{- if .Values.inferenceExtension.latencyPredictor.enabled }}


Trusting that this has been manually validated to work both on & off.

Helm config is difficult to review as is, and this is a rather large block of it

kfswain · 2025-11-21T19:48:33Z

config/charts/inferencepool/values.yaml

-#      port: 8081
-#      protocol: TCP
-#      targetPort: 8081
+  #  extraServicePorts:


nit: is this formatting needed?

No, just an artifact. Will fix.

kfswain · 2025-11-21T19:49:22Z

go.mod

 	github.com/elastic/crd-ref-docs v0.2.0
 	github.com/envoyproxy/go-control-plane/envoy v1.36.0
 	github.com/go-logr/logr v1.4.3
+	github.com/go-logr/zapr v1.3.0


is there a reason we are using/importing a new logger?

kfswain · 2025-11-21T19:50:19Z

pkg/epp/datastore/datastore.go

 	PodList(predicate func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics
 	PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool
-	PodDelete(podNAme string)
+	PodDelete(podName string)


kfswain · 2025-11-21T19:51:44Z

pkg/epp/requestcontrol/director.go

-		Headers:   reqCtx.Response.Headers,
+		RequestId:   reqCtx.Request.Headers[requtil.RequestIdHeaderKey],
+		Headers:     reqCtx.Response.Headers,
+		EndOfStream: reqCtx.ResponseComplete,


Is this needed? Would like to avoid the director being aware of the ext-proc details if possible, or if not, the EoS should signal that the response complete plugin should run I would think

kfswain · 2025-11-21T19:54:40Z

site-src/guides/index.md


+### Next Steps: Advanced Features
+
+You have now deployed a basic Inference Gateway with a simple routing strategy. To explore more advanced features, such as SLO-aware routing, please refer to the following guide:


Lets call this latency-based predictor or some other name,

Also, pls add to the guide sidebar like the other guides, otherwise its hidden in the getting started guide, and that is the only discoverable link (other than just knowing the URL)

ahg-g

it would be great if you can send a separate PR for adding the running requests metric

ahg-g · 2025-11-21T18:46:47Z

cmd/epp/runner/runner.go

-	tracing                                   = flag.Bool("tracing", true, "Enables emitting traces")
+
+	// Latency Predictor Flag
+	enableLatencyPredictor = flag.Bool("enable-latency-predictor", false, "Enable the regression-based latency predictor and scheduler scorer.")


Why do we need an explicit flag? isn't the predictor a plugin? and so the plugins configuration should take care of enablement.

ahg-g · 2025-11-21T18:48:00Z

cmd/epp/runner/runner.go

 	}

-	rawConfig, err := r.parseConfigurationPhaseOne(ctx)
+	// ===================================================================


All of this initialization should be part of the predictor plugin initialization itself. We should not have any predictor specific logic here.

agreed, will move this

ahg-g · 2025-11-21T18:51:05Z

config/charts/inferencepool/README.md

+
+The behavior of the SLO-aware router can be fine-tuned using the following environment variables in the Endpoint Picker deployment. These can be set under `inferenceExtension.env` in your `values.yaml` file.
+
+| Environment Variable             | Description                                                                                             | Default     |


why are those set as env vars instead of a plugin specific configuration parameters?

ahg-g · 2025-11-21T20:07:03Z

config/charts/inferencepool/templates/epp-config.yaml

+    {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+    - type: slo-aware-routing
+    - type: slo-aware-profile-handler
+    - type: max-score-picker


this plugin is added by default, please remove it

ahg-g · 2025-11-21T20:09:10Z

config/charts/inferencepool/templates/epp-config.yaml

+    - name: default
+      plugins:
+      - pluginRef: slo-aware-routing
+        weight: 0


why are we adding this plugin if the weight is 0?

We need this to track requests and gather latency data but, we don't want to use it for routing

This is a design smell

ahg-g · 2025-11-21T20:52:53Z

config/charts/inferencepool/templates/epp-deployment.yaml

            fieldRef:
              fieldPath: metadata.name
+        {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+        - name: PREDICTION_SERVER_URL


I don't think we need any of those new env vars, those should be exposed as a plugin configuration with a default that just works.

ahg-g · 2025-11-21T20:54:44Z

config/charts/inferencepool/templates/epp-deployment.yaml

          mountPath: "/config"
+      {{- if .Values.inferenceExtension.latencyPredictor.enabled }}
+      # Training Server Sidecar Container
+      - name: training-server


can we define this sidecar somewhere else and we embed it here, inlining this whole thing directly here makes it hard to read

ahg-g · 2025-11-21T20:56:52Z

config/charts/inferencepool/templates/epp-service.yaml

    - name: http-metrics
      protocol: TCP
      port: {{ .Values.inferenceExtension.metricsPort | default 9090 }}
+    {{- if .Values.inferenceExtension.latencyPredictor.enabled }}


The predictor is running as a sidecar and will only be contacted by the epp via localhost, we should not need to expose it here. in other words, I don't expect any changes to the service yaml.

There are some metrics that the sidecar emits to EPP and in order to read those from the metrics port on the service we need to expose the port (@kaushikmitr is using this for a dashboard)

EPP can read the metrics via localhost, why does it need to go through service for that?

ahg-g · 2025-11-21T21:00:08Z

pkg/epp/backend/metrics/metrics.go

 		}
 	}

+	if p.MetricMapping.TotalRunningRequests != nil {


Ideally, adding scraping for this metric should have been done in a separate PR.

ahg-g · 2025-11-21T21:07:05Z

pkg/epp/server/runserver.go

 	"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/handlers"
 	"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/requestcontrol"
 	"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/saturationdetector"
+	latencypredictor "sigs.k8s.io/gateway-api-inference-extension/sidecars/latencypredictorasync"


ditto, all predictor related logic should be confined in its plugin; if this is not possible now, then this tells me we have a missing extension point in our framework.

I don't think it's being used at all. See: #1839 (comment)

ahg-g · 2025-11-22T01:09:38Z

@kaushikmitr @BenjaminBraunDev this is predicted latency, not slo, right? if so, please use predicted-latency-scorer as the name of the plugin.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 7, 2025

k8s-ci-robot requested review from ahg-g and elevran November 7, 2025 23:56

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 7, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 7, 2025

This was referenced Nov 8, 2025

[Feat] SLO-Aware Routing with Latency Prediction #1323

Open

SLO Aware Routing Plugins Only #1849

Merged

BenjaminBraunDev force-pushed the slo-aware-routing-stage-3 branch from 2c56616 to f63bc01 Compare November 20, 2025 21:42

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 20, 2025

BenjaminBraunDev changed the title ~~SLO Aware Routing Plugins and Metrics~~ SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment Nov 20, 2025

kaushikmitr reviewed Nov 20, 2025

View reviewed changes

config/charts/inferencepool/values.yaml Outdated Show resolved Hide resolved

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 20, 2025

BenjaminBraunDev added 8 commits November 21, 2025 01:04

Add latency predictor plugins, deployment, and runner.go integration

47f33a8

Update dockerfile, fix issues with SLO context not being set when pre…

8c77f1f

…diciton id off

Remove outdated inferencepool-resources deployment

8ca0214

Fix streamed request being called one final time after request comple…

b337ade

…te, add predictor check to the beginning of each requestcontrol hook

add guide, update helm charts and readme, minor scorer changes

0ae94df

Make small guide update

2e220d7

Add helm values and polish README and SLO routing guide

729c53b

Clean up errors from rebase, add running request metric to datasource…

7b59026

…, add predictor to new 2 phase configuration parser

BenjaminBraunDev added 3 commits November 21, 2025 01:05

Add running request metric to extractor test

07f7ae1

Fix epp image and add placeholder docker repos for latency sidecars

678f608

Update guide, README, and values.yaml

ddee4c7

BenjaminBraunDev force-pushed the slo-aware-routing-stage-3 branch from d177545 to ddee4c7 Compare November 21, 2025 01:12

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 21, 2025

kfswain reviewed Nov 21, 2025

View reviewed changes

ahg-g reviewed Nov 21, 2025

View reviewed changes


		### Install with SLO-Aware Routing

		For full details see the dedicated [SLO-Aware Routing Guide](../../../site-src/guides/slo-aware-routing.md)


		### Next Steps: Advanced Features

		You have now deployed a basic Inference Gateway with a simple routing strategy. To explore more advanced features, such as SLO-aware routing, please refer to the following guide:


		The behavior of the SLO-aware router can be fine-tuned using the following environment variables in the Endpoint Picker deployment. These can be set under `inferenceExtension.env` in your `values.yaml` file.

		\| Environment Variable \| Description \| Default \|

SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment #1839

Are you sure you want to change the base?

SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment #1839

Uh oh!

Conversation

BenjaminBraunDev commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Features:

Plugins

PodMetrics

Guide

Uh oh!

k8s-ci-robot commented Nov 7, 2025

Uh oh!

netlify bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Nov 7, 2025

Uh oh!

ahg-g commented Nov 14, 2025

Uh oh!

BenjaminBraunDev commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

BenjaminBraunDev commented Nov 7, 2025 •

edited

Loading

netlify bot commented Nov 7, 2025 •

edited

Loading

BenjaminBraunDev commented Nov 15, 2025 •

edited

Loading