Skip to content

*: Respect route replacement mode in TrafficPolicy plugin #11553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

timflannagan
Copy link
Member

@timflannagan timflannagan commented Jul 1, 2025

Description

This is the initial PR that implements the route replacement EP. These changes focus on proactively catching structurally valid but semantically invalid configurations that Envoy would NACK at runtime.

From the 1.x APIs, the transformation policy type is an excellent example where this type of guardrail is critical. While a policy might be structurally valid (correct YAML, valid fields), it could contain semantically invalid configurations like:

  • Malformed template expressions
  • Template references to non-existent headers or variables
  • Invalid body transformations that would produce malformed JSON/XML

The TP plugin has been updated to perform additional validation checks depending on the route replacement mode that's been configured for the KG installation:

  • PGV validation: Runs in standard mode, performs quick validation of Envoy configuration types performs basic structural and semantic validation. This short circuits expensive xDS validation when basic issues are found, such as invalid header values or malformed template syntax.
  • xDS validation (via Envoy validate mode): Runs in strict mode and controls whether the plugin will convert Envoy types stored in the IR into a partial bootstrap config and passes this to Envoy in validate mode. This detects potential NACKs before they happen at runtime and catches complex semantic issues like invalid transformations.

Whenever either of these validation checks fail, the route translator replaces the invalid route with a 500 direct response action. Previously, errors encountered during attached policy IR construction only influenced the policy status, which led to valid routes with invalid policy attached to them being pushed to the xDS snapshot and risk undefined behavior, fail open scenarios w.r.t security policy, or NACKs from the proxy. Now, the route translator has been updated to respect these policy IR errors, and trigger route replacement, in addition to the existing policy status.

Additionally, several new libraries have been introduced with these changes:

  • Dedicated Envoy bootstrap library. Responsible for providing a generic partial bootstrap builder
  • Abstracting envoy validate mode via an xDS validation library

These abstractions will be used in follow-up work that builds on top of this foundation and performs additional validation checks (e.g. CDS, RDS, etc.) that help prevent potential NACKs from reaching the fleet of managed proxies.

Change Type

/kind feature

Changelog

The TrafficPolicy plugin now respects the route replacement mode setting (`KGW_ROUTE_REPLACEMENT_MODE`). When in strict mode, the plugin performs additional validation to catch invalid configurations before they reach Envoy. Invalid policies that would cause Envoy to NACK at runtime (e.g. malformed templates) will now be replaced with a direct response (HTTP 500) and report clear status conditions. This prevents fail-open scenarios where invalid policies could allow unintended traffic.

Additional Notes

Fixes #11535

- Previously, errors during policy IR construction only affected policy status
- Now these errors also trigger route replacement behavior
- Prevents valid routes with invalid policy from failing open
- Improves security by ensuring invalid policies don't allow unintended traffic

Signed-off-by: timflannagan <[email protected]>
Remove Gloo branding from the codebase

Signed-off-by: timflannagan <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings July 1, 2025 21:05
@github-actions github-actions bot added kind/feature Categorizes issue or PR as related to a new feature. release-note labels Jul 1, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds strict route replacement validation by integrating a new Envoy bootstrap builder and validator, updating the TrafficPolicy plugin to respect strict mode, and adjusting route translation to replace invalid routes with a 500 direct response.

  • Introduces pkg/xds/bootstrap for constructing partial bootstrap configs.
  • Implements pkg/validator to validate Envoy configs via binary or Docker.
  • Updates TrafficPolicy translation to perform XDS validation in strict mode and replace invalid routes.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/xds/bootstrap/builder.go Implements ConfigBuilder for partial Envoy bootstrap configs
pkg/validator/validator.go Adds Validator interfaces for Envoy validation
pkg/pluginsdk/ir/iface.go Extends TypedFilterConfigMap with ToAnyMap()
internal/kgateway/translator/irtranslator/route.go Adds direct response body and error accumulation in routes
internal/kgateway/extensions2/plugins/trafficpolicy/validate.go Invokes XDS validation in TrafficPolicy Validate
internal/kgateway/extensions2/plugins/trafficpolicy/traffic_policy_plugin.go Applies strict mode validation based on settings
Comments suppressed due to low confidence (3)

pkg/xds/bootstrap/builder.go:1

  • The file header comment references 'config.go' but this file is 'builder.go'. Please update the comment to match the file name.
// pkg/xds/bootstrap/config.go

pkg/xds/bootstrap/builder.go:2

  • Consider adding unit tests for ConfigBuilder (e.g., Verify that Build() produces a valid partial bootstrap with expected listeners, routes, and clusters).
package bootstrap

pkg/validator/validator.go:1

  • Consider adding unit tests or integration tests for both binaryValidator and dockerValidator to ensure they correctly detect valid and invalid configurations.
package validator

"github.com/kgateway-dev/kgateway/v2/pkg/xds/bootstrap"
)

func (p *TrafficPolicy) Validate(ctx context.Context, v validator.Validator, policy *v1alpha1.TrafficPolicy) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I kept the status optimization outside of this PR. There's some weirdness that I still need to figure out locally, and was leading to policy status being wiped / incorrectly reported on controller restarts.

…e test cases, fix missing GE references

Signed-off-by: timflannagan <[email protected]>
Comment on lines +28 to +30
// TODO(tim): this logic will be refactored in the future to be less brittle,
// easier to read/maintain/etc. but requires additional traffic policy plugin
// refactoring to do properly.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO for this code. In it's current form, it's incredibly hard to read and has a high risk of maintenance burden without a proper refactor that will be done in a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this test case and the listener-level attachment are no-ops. Follow-up work needs to be done to the route translator to handle higher-level attachment w.r.t. triggering route replacement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing test case moved to the route replacement suite. Related to route replacement where the validateEnvoyRoute function will return an error, and that error triggers the output Envoy route to use a direct response action.

Comment on lines +310 to +311
// skip plugin application if we encountered any errors while constructing
// the policy IR.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to highlight this behavior - this mainly has impact on the listener-level config that's disabled by default:

diff --git a/internal/kgateway/translator/gateway/testutils/outputs/route-replacement/strict/invalid-csrf-regex-config-out.yaml b/internal/kgateway/translator/gateway/testutils/outputs/route-replacement/strict/invalid-csrf-regex-config-out.yaml
index 8526acb691..f042e46727 100644
--- a/internal/kgateway/translator/gateway/testutils/outputs/route-replacement/strict/invalid-csrf-regex-config-out.yaml
+++ b/internal/kgateway/translator/gateway/testutils/outputs/route-replacement/strict/invalid-csrf-regex-config-out.yaml
@@ -23,11 +23,6 @@ Listeners:
       typedConfig:
         '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
         httpFilters:
-        - name: envoy.filters.http.csrf
-          typedConfig:
-            '@type': type.googleapis.com/envoy.extensions.filters.http.csrf.v3.CsrfPolicy
-            filterEnabled:
-              defaultValue: {}

}
config, err := utils.MessageToAny(v)
if err != nil {
logger.Error("unexpected marshalling error", "error", err)
Copy link
Member Author

@timflannagan timflannagan Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing error handling that I moved to this package. Open question on whether it's correct.

Comment on lines +158 to +159
out *envoy_config_route_v3.Route,
) error
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to these changes, but updated this formatting to be consistent with the other ApplyForX signatures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up work will expand this library to be able to perform RDS/CDS validation. Good enough for now though.

Comment on lines -41 to -43
# ATTENTION: when updating to a new major version of Envoy, check if
# universal header validation has been enabled and if so, we expect
# failures in `test/e2e/header_validation_test.go`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just irrelevant at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but tagged ashish to double check whether it's still relevant.

func(s *settings.Settings) {
s.RouteReplacementMode = settings.RouteReplacementStrict
}),
Entry("Strict Mode - Valid Structure Invalid Template (Runtime Error)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the strict mode even in translator tests will still call out to envoy binary for mode validate right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the envoy binary exists on the file system, yes, otherwise falls back to the docker image for validation.

@timflannagan timflannagan added this pull request to the merge queue Jul 9, 2025
Merged via the queue into kgateway-dev:main with commit 9efc345 Jul 9, 2025
21 checks passed
@timflannagan timflannagan deleted the feat/traffic-pol-validation branch July 9, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TrafficPolicy respects the route replacement mode
3 participants