-
Notifications
You must be signed in to change notification settings - Fork 532
Enhancement Proposal - introduce comprehensive load testing framework for KGateway #11597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…KGateway Signed-off-by: MayorFaj <[email protected]>
Signed-off-by: MayorFaj <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
A design proposal introducing a comprehensive load testing framework for KGateway, detailing core test scenarios, CI/CD integration, and test planning.
- Defines Attached Routes, Route Probe, and Route Change test scenarios based on gateway-api-bench patterns
- Describes integration into existing Ginkgo-based e2e tests and Makefile targets for local and CI usage
- Presents alternatives, open questions for thresholds, and technical implementation considerations
Comments suppressed due to low confidence (3)
design/11210.md:116
- [nitpick] Consider adding an explicit hyperlink to the gateway-api-bench repository (e.g., https://github.com/kubernetes-sigs/gateway-api-bench) so readers can quickly locate the benchmark source.
* **Integrated Benchmark Tests**: Run gateway-api-bench benchmarks as part of KGateway's test suite
design/11210.md:134
- [nitpick] It may help maintainers if the Go snippet includes the import path for
load_testing
(e.g.,import "test/kubernetes/e2e/features/load_testing"
) so they know whereNewTestingSuite
is defined.
func KubeGatewaySuiteRunner() e2e.SuiteRunner {
design/11210.md:188
- [nitpick] The decision label for Alternative 2 is unbolded. For consistency, consider bolding both statuses (e.g.,
**Decision**: **REJECTED**
) or matching formatting.
* **Decision**: **PROPOSED** - Leverages proven patterns within existing infrastructure
@howardjohn will have good inputs for perf testing |
### CI/CD Integration Strategy | ||
|
||
1. **Pipeline Integration**: | ||
* Should load tests run in CI for every PR or only on specific triggers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only run the tests on 1) releases, 2) nightlys
|
||
1. **Framework Architecture**: | ||
* Should we implement a simulation framework similar to gateway-api-bench patterns? | ||
* What's the preferred approach: bulk operations vs batched operations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define bulk. vs. batched operations here (I'm guessing it's referring to applying all HTTPRoutes vs. batching groups of routes applied?).
I would use batch (ie. apply 100 routes at a time) for more fine-grained measurements and observability.
|
||
## Alternatives | ||
|
||
### Alternative 1: Ginkgo E2E Test Integration (PROPOSED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all of our e2e tests are currently using testify instead of Ginkgo: https://github.com/kgateway-dev/kgateway/blob/main/test/kubernetes/e2e/suite.go#L7
What are the benefits of using Ginkgo here instead of testify?
#### Unit Tests | ||
|
||
* Test simulation framework components (runner, metrics, watcher) | ||
* Mock Gateway and HTTPRoute resource handling | ||
* Metrics collection and reporting logic | ||
* Performance baseline calculations | ||
|
||
#### Integration Tests | ||
|
||
* End-to-end load test scenarios within KGateway's existing e2e framework | ||
* Integration with KGateway's test installation and cleanup | ||
* Performance regression detection | ||
* CI/CD pipeline integration testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just want to focus on the performance baselines. We can add unit tests as needed but there already should be tests for the metrics collection, and resource handling.
For the load testing e2e tests, how are they different from the performance baselines? We want to focus on testing the control plane, not the data plane here.
### Test Coverage and Scenarios | ||
|
||
1. **Monitoring and Metrics Collection**: | ||
* What metrics should be collected during load tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some wip documentation here on the control plane metrics: kgateway-dev/kgateway.dev#296
We probably want to look at:
- kgateway_collection_transforms_total
- kgateway_collection_transform_duration_seconds
- kgateway_collection_resources
- kgateway_status_syncer_resources
### Performance Thresholds and Baselines | ||
|
||
1. **Setup and Teardown Time Targets**: | ||
* What should be the target setup time for 1000 routes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the https://github.com/howardjohn/gateway-api-bench it looks like kgateway had a setup time of 12s. We should measure the setup time though and also report if it's taking longer than expected.
|
||
1. **Setup and Teardown Time Targets**: | ||
* What should be the target setup time for 1000 routes? | ||
* Should teardown time have the same threshold as setup time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The teardown in the blog for kgateway was about 16s. So I think it's reasonable to aim for under 30s for the setup and tear down. If it takes longer, we should trigger a regression alert and fail the test.
* What should be the target setup time for 1000 routes? | ||
* Should teardown time have the same threshold as setup time? | ||
* Should thresholds scale based on route count | ||
* What's the acceptable performance degradation for multi-gateway scenarios? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also test the multiple gateway scenario, but the setup/teardown threshold may be higher. Let's initially build a single gateway performance test in the first iteration, then add the multi-gateway setup in the followup.
1. **Setup and Teardown Time Targets**: | ||
* What should be the target setup time for 1000 routes? | ||
* Should teardown time have the same threshold as setup time? | ||
* Should thresholds scale based on route count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow the blog and use the same scenario:
To simulate a large cluster, we run with 50 namespaces with 100 routes each (5,000 routes total)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with the batch apply approach, thresholds can be per batch of applied routes (per 100 routes. E.g., ≤ 6s/100 routes, actual ranges will have to be figured out with testing)
* What's the acceptable performance degradation for multi-gateway scenarios? | ||
|
||
2. **Production Scale Requirements**: | ||
* What are the target production scales we're optimizing for? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow the "large cluster" scenario and run with 50 namespaces with 100 routes each (5,000 routes total)
* What are the target production scales we're optimizing for? | ||
* Maximum expected routes per gateway in production environments? | ||
* Should we test different route counts (100, 500, 1000, 5000)? | ||
* What's the expected number of gateways per cluster in production? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do a single gateway test and multi-gateway test with separate thresholds
|
||
1. **Monitoring and Metrics Collection**: | ||
* What metrics should be collected during load tests? | ||
* Should we monitor memory/CPU usage of kgateway components? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! In addition to the control plane metrics, we should look at the control plane cpu and memory. Looking at the data plane cpu/memory usage can be a follow up.
1. **Monitoring and Metrics Collection**: | ||
* What metrics should be collected during load tests? | ||
* Should we monitor memory/CPU usage of kgateway components? | ||
* Do we need to track API server response times? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What to you mean by API server response time here? Response time for a request to go through the data plane?
* What metrics should be collected during load tests? | ||
* Should we monitor memory/CPU usage of kgateway components? | ||
* Do we need to track API server response times? | ||
* Should we measure end-to-end latency (creation to traffic-ready)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question, if we want to only look at control plane metrics vs. track the full latency from HTTPRoute creation to first successful probe. @jmunozro has some experience here and might know?
* Automated regression detection comparing current vs. baseline results | ||
* Performance trend analysis over time | ||
* Alert thresholds for CI/CD failures | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a note that the performance tests will be run on a single node kind cluster (exact specs TBD, John used: 16-core AMD 9950x CPU with 96GB RAM)
* Ginkgo test that generates continuous HTTP traffic using test helpers | ||
* Apply route configuration changes during traffic | ||
* Monitor request success rates and response times | ||
* **Metrics**: Request success rate, latency distribution, error count during changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to measure request success rates/latency since we're focusing the performance testing on the control plane
|
||
Performance testing is essential for Gateway API implementations as they handle critical traffic routing decisions. Current testing focuses primarily on functional correctness, leaving performance characteristics largely unvalidated. A comprehensive load testing framework will help identify performance bottlenecks, ensure scalability, and prevent regressions in production deployments. | ||
|
||
The proposed framework will provide standardized performance testing patterns that can be consistently applied across KGateway development cycles, enabling reliable performance validation and regression detection both locally and in CI/CD pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a reference to the benchmarking repo to cite the source and give some background: https://github.com/howardjohn/gateway-api-bench
Also would be good to call out some notes in this background section:
- A single HTTPRoute may be attached to many Gateway's simultaneously.
- The Gateway object has a status message with the
attachedRoutes
which stores a count of the total successfully attached route objects. - Setup time: time after the last route is created until the attachedRoutes is updated to the total route count.
- Teardown time: time after the last route is deleted until the status is reset back to zero.
* What's the optimal concurrency level to avoid API server overload? | ||
* Should concurrency scale with cluster size/resources? | ||
* What are the Kubernetes API server rate limits we need to respect? | ||
* Minimum cluster requirements for running load tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a single-node kind cluster to help minimize noise, John's blog used 16-core AMD 9950x CPU with 96GB RAM, we can determine what makes sense in CI with some testing, but can start with the standard github runner we're using for the nightlys (ubuntu-22.04).
### Technical Implementation | ||
|
||
1. **Framework Architecture**: | ||
* Should we implement a simulation framework similar to gateway-api-bench patterns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean using https://github.com/howardjohn/pilot-load to simulate clusters but no pods are scheduled onto any physical machine?
1. **Framework Architecture**: | ||
* Should we implement a simulation framework similar to gateway-api-bench patterns? | ||
* What's the preferred approach: bulk operations vs batched operations? | ||
* How many concurrent workers should be used for route creation/deletion? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with a single worker for route creation, then add concurrent workers as needed.
* Where should historical performance data be stored? | ||
|
||
3. **Result Persistence and Reporting**: | ||
* How will performance metrics integrate with existing e2e test reporting? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great questions, but might make sense as a set of follow-up tasks. When running the performance tests locally, users should be able to use a granfana dashboard (something like https://github.com/howardjohn/pilot-load/blob/master/install/dashboard.json) to view the results.
* Should we use server-side apply or standard create operations? | ||
|
||
2. **Baseline Establishment and Regression Detection**: | ||
* How should initial performance baselines be established? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's base the initial baselines based on the performance results from John's blog. For the patch applies, we'll need to experiment to find reasonable thresholds.
For the control plane status updates:
| | Cilium | Envoy Gateway | Istio | Kgateway | Nginx |
| ------------------------------ | ------ | ------------- | ----- | -------- | ----- |
| Setup time | 1s | 10s | 2s | 17s | 17s |
| Teardown time | 27s | 45s | 27s | 28s | 28s |
| Total Writes | 243 | 1484 | 73 | 15 | 128 |
CPU/Memory threshold max:
| Gateway | Relative CPU Consumption | Relative Memory Usage |
Kgateway | 6.6x | 6.2x
For the load testing:
DEST QPS CONS DUR PAYLOAD THROUGHPUT P50 P90 P99
kgateway 0 1 30 0 36793.86qps 0.028ms 0.030ms 0.043ms
kgateway 15000 1 10 0 14998.28qps 0.028ms 0.036ms 0.089ms
kgateway 0 16 10 0 170469.20qps 0.085ms 0.144ms 0.227ms
kgateway 15000 16 10 0 14998.34qps 0.091ms 0.148ms 0.230ms
kgateway 100000 16 10 0 99997.92qps 0.046ms 0.119ms 0.225ms
kgateway 50000 16 10 4096 49998.15qps 0.043ms 0.064ms 0.148ms
kgateway 0 512 10 0 400687.22qps 1.215ms 2.254ms 3.575ms
Change Type
Changelog