-
Notifications
You must be signed in to change notification settings - Fork 195
SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment #1839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment #1839
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: BenjaminBraunDev The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @BenjaminBraunDev. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
2c56616 to
f63bc01
Compare
…te, add predictor check to the beginning of each requestcontrol hook
…, add predictor to new 2 phase configuration parser
d177545 to
ddee4c7
Compare
| COPY apix ./apix | ||
| COPY api ./api | ||
| COPY version ./version | ||
| COPY sidecars ./sidecars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be called sidecars if its just a go interface for the predictor API
|
|
||
| // Latency Predictor Flag | ||
| enableLatencyPredictor = flag.Bool("enable-latency-predictor", false, "Enable the regression-based latency predictor and scheduler scorer.") | ||
| tracing = flag.Bool("tracing", true, "Enables emitting traces") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move tracing above since its not strictly just for the latency predictor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| Director: director, | ||
| SaturationDetector: saturationDetector, | ||
| UseExperimentalDatalayerV2: r.featureGates[datalayer.FeatureGate], // pluggable data layer feature flag | ||
| LatencyPredictor: predictor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is used anywhere, could be an artifact before everything was transitioned to the plugin format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's not, will remove
|
|
||
| ### Install with SLO-Aware Routing | ||
|
|
||
| For full details see the dedicated [SLO-Aware Routing Guide](../../../site-src/guides/slo-aware-routing.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is gonna be a bad link, site-src is the source of truth for our static website: https://gateway-api-inference-extension.sigs.k8s.io/
Your PR comes with a preview: https://deploy-preview-1839--gateway-api-inference-extension.netlify.app/ so you should be able to validate the correct URL pathing that way
| volumeMounts: | ||
| - name: plugins-config-volume | ||
| mountPath: "/config" | ||
| {{- if .Values.inferenceExtension.latencyPredictor.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trusting that this has been manually validated to work both on & off.
Helm config is difficult to review as is, and this is a rather large block of it
| # port: 8081 | ||
| # protocol: TCP | ||
| # targetPort: 8081 | ||
| # extraServicePorts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this formatting needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just an artifact. Will fix.
| github.com/elastic/crd-ref-docs v0.2.0 | ||
| github.com/envoyproxy/go-control-plane/envoy v1.36.0 | ||
| github.com/go-logr/logr v1.4.3 | ||
| github.com/go-logr/zapr v1.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we are using/importing a new logger?
| PodList(predicate func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics | ||
| PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool | ||
| PodDelete(podNAme string) | ||
| PodDelete(podName string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty
| Headers: reqCtx.Response.Headers, | ||
| RequestId: reqCtx.Request.Headers[requtil.RequestIdHeaderKey], | ||
| Headers: reqCtx.Response.Headers, | ||
| EndOfStream: reqCtx.ResponseComplete, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Would like to avoid the director being aware of the ext-proc details if possible, or if not, the EoS should signal that the response complete plugin should run I would think
|
|
||
| ### Next Steps: Advanced Features | ||
|
|
||
| You have now deployed a basic Inference Gateway with a simple routing strategy. To explore more advanced features, such as SLO-aware routing, please refer to the following guide: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets call this latency-based predictor or some other name,
Also, pls add to the guide sidebar like the other guides, otherwise its hidden in the getting started guide, and that is the only discoverable link (other than just knowing the URL)
ahg-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if you can send a separate PR for adding the running requests metric
| tracing = flag.Bool("tracing", true, "Enables emitting traces") | ||
|
|
||
| // Latency Predictor Flag | ||
| enableLatencyPredictor = flag.Bool("enable-latency-predictor", false, "Enable the regression-based latency predictor and scheduler scorer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an explicit flag? isn't the predictor a plugin? and so the plugins configuration should take care of enablement.
| } | ||
|
|
||
| rawConfig, err := r.parseConfigurationPhaseOne(ctx) | ||
| // =================================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this initialization should be part of the predictor plugin initialization itself. We should not have any predictor specific logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, will move this
|
|
||
| The behavior of the SLO-aware router can be fine-tuned using the following environment variables in the Endpoint Picker deployment. These can be set under `inferenceExtension.env` in your `values.yaml` file. | ||
|
|
||
| | Environment Variable | Description | Default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are those set as env vars instead of a plugin specific configuration parameters?
| {{- if .Values.inferenceExtension.latencyPredictor.enabled }} | ||
| - type: slo-aware-routing | ||
| - type: slo-aware-profile-handler | ||
| - type: max-score-picker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this plugin is added by default, please remove it
| - name: default | ||
| plugins: | ||
| - pluginRef: slo-aware-routing | ||
| weight: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we adding this plugin if the weight is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this to track requests and gather latency data but, we don't want to use it for routing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a design smell
| fieldRef: | ||
| fieldPath: metadata.name | ||
| {{- if .Values.inferenceExtension.latencyPredictor.enabled }} | ||
| - name: PREDICTION_SERVER_URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need any of those new env vars, those should be exposed as a plugin configuration with a default that just works.
| mountPath: "/config" | ||
| {{- if .Values.inferenceExtension.latencyPredictor.enabled }} | ||
| # Training Server Sidecar Container | ||
| - name: training-server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define this sidecar somewhere else and we embed it here, inlining this whole thing directly here makes it hard to read
| - name: http-metrics | ||
| protocol: TCP | ||
| port: {{ .Values.inferenceExtension.metricsPort | default 9090 }} | ||
| {{- if .Values.inferenceExtension.latencyPredictor.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The predictor is running as a sidecar and will only be contacted by the epp via localhost, we should not need to expose it here. in other words, I don't expect any changes to the service yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some metrics that the sidecar emits to EPP and in order to read those from the metrics port on the service we need to expose the port (@kaushikmitr is using this for a dashboard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EPP can read the metrics via localhost, why does it need to go through service for that?
| } | ||
| } | ||
|
|
||
| if p.MetricMapping.TotalRunningRequests != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, adding scraping for this metric should have been done in a separate PR.
| "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/handlers" | ||
| "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/requestcontrol" | ||
| "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/saturationdetector" | ||
| latencypredictor "sigs.k8s.io/gateway-api-inference-extension/sidecars/latencypredictorasync" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, all predictor related logic should be confined in its plugin; if this is not possible now, then this tells me we have a missing extension point in our framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's being used at all. See: #1839 (comment)
|
@kaushikmitr @BenjaminBraunDev this is predicted latency, not slo, right? if so, please use |
This PR is stage 3/3 for adding in the latency prediction and SLO-Aware Routing functionality to EPP.
New Features:
-enable-latency-predictorflag in EPP arg to inform it that sidecars are present and to register slo routing plugins.x-slo-ttft-msandx-slo-tpot-ms) and a boolean for whether to use the SLO routing scheduling profile with slo scoring (x-prediction-based-scheduling). If false, use the default profile and just track and train for future requests.Plugins
Registers and deploys the plugins added in #1849 via scheduling profiles:
PodMetrics
Adds (back) the
totalRunningRequestsMetricprometheus metric from vLLM, which was removed for being unused in the past, but is now a feature of our latency prediction model.Guide
Added a guide for how to deploy IGW with SLO-Aware Routing in site-src/guides/slo-aware-routing.md
Fixes #1323
Does this PR introduce a user-facing change?: