-
Notifications
You must be signed in to change notification settings - Fork 520
EP-10625: Invalid Route Replacement #11227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
EP-10625: Invalid Route Replacement #11227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces a design proposal for implementing invalid route replacement to prevent misrouted requests and accidental exposure of secure routes.
- Introduces the invalid route replacement feature with clear scenarios and motivations.
- Details configuration options and implementation details for the control plane, controllers, monitoring, and telemetry.
- Provides a comprehensive test plan and discusses alternatives and open questions.
design/10625.md
Outdated
- Invalid response code - The HTTP code returned to the end user hitting an invalid route | ||
- Default: 404 | ||
- TODO: Making this code configurable may break compliance with the current k8s gateway API Spec (See "K8s Gateway API Spec" section above) | ||
- Enable/Disable feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the alternative is don't update envoy at all vs disable route replacement and send the bad config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove disable, pending CS feedback
design/10625.md
Outdated
|
||
### Monitoring & Telemetry | ||
|
||
#### Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% need metrics and we should make sure these are well documented in our production recommendations
design/10625.md
Outdated
|
||
#### Metrics | ||
When an invalid route is replaced, we should increment a metric (Gauge) so that this can be piped into a monitoring dashboard. | ||
TODO: Should we have CR-level labels on this metric, or does that result in problems with high cardinality? Would help identify problematic resources easily, if it's feasible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question re-label and cardinality; if we don't do it as labels, maybe we can do it in the admin page? and expose it via check
command in our future CLI?
design/10625.md
Outdated
|
||
#### Logs | ||
When an invalid route is replaced, it should be logged. | ||
TODO: Confirm we're OK with this, could end up being verbose in larger environments. General problem with translation errors, particularly if we consider this a 'warning'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is disabled, and we instead delay sending route updates to envoy, this should be Error
design/10625.md
Outdated
When handling the translation logic for HTTPRoutes, any configuration which would make the route invalid - previously resulting in the route being ellided from the resulting configuration sent to Envoy - will instead result in a direct response route being configured, with the settings above. | ||
|
||
Examples of things that could result in the route being invalid: | ||
- Missing destination for route target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the details here will be extremely important.
A missing route target, meaning an invalid BackendRef I assume, has well-defined behavior in the upstream Gateway API spec.
Do we want the kgw route replacement feature to supersede behavior in the spec?
design/10625.md
Outdated
|
||
## Background | ||
|
||
Currently, when a user adds an invalid route - for example, a route to an upstream which does not currently exist - we ellide that route from the configuration sent to Envoy. In many scenarios, this results in users making requests that would otherwise match that route, to instead receive a "No Healthy Upstream" message along with a HTTP 404. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to define and document the current behavior for the following questions:
- what is an invalid route from a Gateway API perspective?
- what is an invalid route from a kgw policy perspective?
- are there any scenarios other than these 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An Invalid route is:
- A route we know for sure Envoy would Nack.
- A route, which may otherwise be valid, but which has an invalid policy attached
- Which subset of total policies is still TBD. Any policy that's security-based or request-impacting.
For routes which are missing a backend:
- We are not using the route replacement as outlined in this doc
- We are leaning on Envoy's built-in handling of missing cluster(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a policy can return a warning or an error. if a route only has errors we update status but let it go through to envoy.
if a route has errors we handle as described below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a policy can return a warning or an error. if a route only has errors we update status but let it go through to envoy.
Expanded on the error handling for policies in the last couple of commits. Right now, any errors encountered during IR construction, e.g. referential errors from attachment issues or structural issues highlighted by xDS validation, are treated the same. We can explore adding sentinel errors that delineate between a warning and terminal errors, but that doesn't seem immediately necessary.
We are leaning on Envoy's built-in handling of missing cluster(s).
I'm less familiar with this mechanism. Any documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can explore adding sentinel errors that delineate between a warning and terminal errors, but that doesn't seem immediately necessary.
Agreed that it is not immediately necessary, but for any extensions built around the route replacement feature, this may become important.
We are leaning on Envoy's built-in handling of missing cluster(s).
I'm less familiar with this mechanism. Any documentation?
That is referring to the clusterNotFound functionality to satisfy upstream Gateway API standards on response codes etc:
kgateway/internal/kgateway/translator/irtranslator/route.go
Lines 428 to 433 in e4cc62a
action := outRoute.GetRoute() | |
if action == nil { | |
action = &envoy_config_route_v3.RouteAction{ | |
ClusterNotFoundResponseCode: envoy_config_route_v3.RouteAction_INTERNAL_SERVER_ERROR, | |
} | |
} |
Lastly, given that a backend not found results in this 500 and NOT route replacement, I believe the opening sentence is not correct --
Currently, when a user adds an invalid route - for example, a route to an upstream which does not currently exist - we ellide that route from the configuration sent to Envoy.
We do not elide the route, but rely on the aforementioned behavior to return a 500
design/10625.md
Outdated
|
||
### Configuration | ||
|
||
The following settings need to be user-configurable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- have each policy call envoy-validate during translation to IR
- if envoy validate fails, we set a EnvoyValidateFailed condition on the status
- we only call envoy mode validate if the condition is missing for the current generation
- during route translation we call envoy validate only to the route parts (not the per filter config) and LRU cache it.
- policies can return a special error of type Warning. if route errors are all warnings, it will not trigger route replacement, just status message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if envoy validate fails, we set a EnvoyValidateFailed condition on the status
I'm not sure this is something we can do with our approach to policy status as we're using v2alpha1 GW API policy types internally. Introducing another reason would be inconsistent with how policy errors are handled in the route translator today.
Signed-off-by: timflannagan <[email protected]>
AI typo fixes Co-authored-by: Copilot <[email protected]> Signed-off-by: Shane O'Donnell <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
1026eec
to
ff86f89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking mostly solid to me.
My only real concern is with treating all errors as the same within the system.
design/10625.md
Outdated
|
||
## Background | ||
|
||
Currently, when a user adds an invalid route - for example, a route to an upstream which does not currently exist - we ellide that route from the configuration sent to Envoy. In many scenarios, this results in users making requests that would otherwise match that route, to instead receive a "No Healthy Upstream" message along with a HTTP 404. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can explore adding sentinel errors that delineate between a warning and terminal errors, but that doesn't seem immediately necessary.
Agreed that it is not immediately necessary, but for any extensions built around the route replacement feature, this may become important.
We are leaning on Envoy's built-in handling of missing cluster(s).
I'm less familiar with this mechanism. Any documentation?
That is referring to the clusterNotFound functionality to satisfy upstream Gateway API standards on response codes etc:
kgateway/internal/kgateway/translator/irtranslator/route.go
Lines 428 to 433 in e4cc62a
action := outRoute.GetRoute() | |
if action == nil { | |
action = &envoy_config_route_v3.RouteAction{ | |
ClusterNotFoundResponseCode: envoy_config_route_v3.RouteAction_INTERNAL_SERVER_ERROR, | |
} | |
} |
Lastly, given that a backend not found results in this 500 and NOT route replacement, I believe the opening sentence is not correct --
Currently, when a user adds an invalid route - for example, a route to an upstream which does not currently exist - we ellide that route from the configuration sent to Envoy.
We do not elide the route, but rely on the aforementioned behavior to return a 500
design/10625.md
Outdated
### Goals | ||
|
||
- Implement invalid route replacement, for the following scenarios: | ||
- Configured target destination does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding my earlier comment, is this actually a scenario we want route replacement to handle?
IIUC this scenario is explicitly handled already?
design/10625.md
Outdated
|
||
- Replace any `nil` route with a direct-response route when the feature gate is enabled | ||
- Consolidate all error handling at the IR level; plugins no longer bubble errors via `ApplyForX` methods | ||
- Referential errors for policy, e.g. non-existent backing GatewayExtension provider, should be treated the same as structural errors that are highlighted by xDS validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned above, but this means that "errors" that can be fixed via "eventual consistency" can not be delineated from other errors. This will become a challenge for extensions that rely on this behavior.
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Description
Enhancement proposal to bring invalid route replacement to kgateway. This is a feature the 1.x project previously had. Although not natively supported by the k8s Gateway API spec currently, it has proven to be an invaluable feature for many gateway users over the years, and has prevented production outages and security incidents. This feature results in improved UX/QoL for gateway admins and route creators, as well as improved guarantees for route atomicity - important for multi-tenant or shared-gateway environments.
Design to address route replacement feature identified in #10625.
Change Type
Changelog
Additional Notes