From 15a3db2301114b9720c9966439a550c26618dbc7 Mon Sep 17 00:00:00 2001 From: Miciah Masters Date: Wed, 25 Sep 2024 23:44:52 -0400 Subject: [PATCH] ingress: Add dynamic-config-manager enhancement * enhancements/ingress/dynamic-config-manager.md: New file. --- .../ingress/dynamic-config-manager.md | 469 ++++++++++++++++++ 1 file changed, 469 insertions(+) create mode 100644 enhancements/ingress/dynamic-config-manager.md diff --git a/enhancements/ingress/dynamic-config-manager.md b/enhancements/ingress/dynamic-config-manager.md new file mode 100644 index 0000000000..0f89f9be76 --- /dev/null +++ b/enhancements/ingress/dynamic-config-manager.md @@ -0,0 +1,469 @@ +--- +title: dynamic-config-manager +authors: + - "@Miciah" +reviewers: + - TBD +approvers: + - TBD +api-approvers: + - TBD +creation-date: 2024-09-25 +last-updated: 2024-09-25 +tracking-link: + - https://issues.redhat.com/browse/RFE-1439 + - https://issues.redhat.com/browse/OCPSTRAT-525 + - https://issues.redhat.com/browse/NE-879 + - https://issues.redhat.com/browse/OCPSTRAT-422 + - https://issues.redhat.com/browse/NE-870 +see-also: +replaces: +superseded-by: +--- + +# OpenShift Router Dynamic Config Manager + +## Summary + +OpenShift 4.18 enables Dynamic Config Manager with 1 pre-allocated server per +backend, without blueprint routes, and without any configuration options. The +goal is to deliver a minimum viable product on which to iterate. This MVP +provides marginal value by avoiding reloading HAProxy for a single scale-out +event or subsequent scale-in event for a route, at minimal development and +operational cost. More importantly, the MVP gives us CI signal, enables us to +work out defects in DCM, and gives us a starting point from which to enhance DCM +in subsequent OpenShift releases. In the future, we intend to extend DCM with +capabilities such as adding servers dynamically rather than pre-allocating them, +as well as configuring backends and certificates dynamically, thereby avoiding +reloading HAProxy for most or all updates to routes or their associated +endpoints. + +## Motivation + +OpenShift router has long suffered from issues related to long-lived connections +and frequent reloads. The model for updating the router's configuration in +response to route and endpoints updates is to write out a new `haproxy.config` +file and reload HAProxy, which forks a new process and keeps the old process +around until it has closed all the connections that it had open at the time of +the fork. This fork-and-reload approach has negative implications for +performance, metrics, and balancing. Foremost, when HAProxy is handling +long-lived connections during repeated configuration updates, old processes +accumulate and use exorbitant amounts of memory. Additionally, the +fork-and-reload approach reduces accuracy of metrics as the metrics are only +updated for the new process. For the same reason, the fork-and-reload approach +reduces the accuracy of HAProxy's load balancing algorithms. + +The solution to these issues is to configure HAProxy dynamically. This is +exactly what the Dynamic Config Manager (DCM) does: DCM configures a running +HAProxy process through a Unix domain socket. This means that no forking is +necessary to update configuration and effectuate the changes. + +However, DCM requires some work before it can be enabled. DCM was first +implemented in OpenShift 3.11 and was never enabled by default in OpenShift 3 or +even allowed as a supported option in OpenShift 4. This lack of exposure means +that DCM now needs extensive testing before we can be confident that it is safe +for production environments. In addition, DCM in its present form is difficult +to configure, and it has many cases for which it is not able to handle +configuration updates. When dynamic configuration fails, DCM falls back to the +old fork-and-reload procedure. Finally, DCM was implemented for HAProxy 1.8 and +does not take full advantage of the capabilities of newer HAProxy versions. In +sum, DCM requires substantial work to develop and verify it in order to make it +viable. + +### User Stories + +_As a cluster administrator, I want OpenShift router not to use excessive memory +when the HAProxy configuration changes and the HAProxy process has many +long-lived connections._ + +Without the Dynamic Config Manager, OpenShift router suffers from a well known +performance issue when the following two conditions are met: + +* A router pod reloads its configuration frequently because of route or endpoints updates. +* The same router pod handles long-lived connections. + +Reloading the HAProxy configuration involves forking a new process with the new +configuration. The old process remains open until all connections that were +open at the time of the configuration reload have terminated. In the case of +long-lived connections, this means that the old process can remain for a long +period of time. If the router has frequent configuration reloads due to changes +to route configuration, this can cause these old processes to accumulate and use +a large amount of memory, on the order of hundreds of megabytes or multiple +gigabytes per process, ultimately causing out-of-memory errors on the node host. + +DCM addresses this issue by reducing the need to reload the configuration. +Instead, DCM configures HAProxy using its Unix domain control socket, which does +not require forking a new HAProxy process. + +The degree to which DCM mitigates this issue is dependent on the nature of the +configuration changes: Some changes still require a configuration reload, but +the majority of changes can be performed through HAProxy's control socket. + +Initially, DCM will allow scale-out and scale-in of 1 server (pod endpoint) per +backend (route). Scaling out more than 1 server will still require a fork and +reload initially. In future iterations of the feature, DCM can be enhanced to +enable scale-out and scale-in of arbitrarily many servers as pods are created +and deleted, and also scale-out and scale-in of backends as routes are created +and deleted, as well as changes to certificates and other route options, all +without requiring a fork and reload. + +_As a cluster administrator, I want metrics to be accurate when routes and +endpoints are updated._ + +OpenShift router does take care to preserve metrics values across reloads in +order to avoid resetting counters to zero when the new process starts. However, +while values from the old process are preserved, the old process cannot update +metrics after the reload. For example, the metrics reflect the total number of +bytes that the old process had transferred at the point of the configuration +reload, but if the old process continues transferring data after the new process +starts, the metrics do not reflect the count of any additional bytes +transferred. + +Dynamic Config Manager addresses this issue again by reducing the need to reload +the configuration and fork a new process. Again, the degree to which DCM +mitigates this issue is dependent on the nature of the configuration changes. + +_As a project administrator, I want HAProxy to balance traffic evenly over old +and new pods when I scale my application up._ + +HAProxy tracks the number of connections for each of HAProxy's backend servers. +HAProxy uses this information to balance traffic evenly using the "roundrobin", +"random", and "leastconn" balancing algorithms. However, following a reload, +the new HAProxy process does not have data on how many connections the old +processes have to each backend server. This lack of data can cause uneven +traffic load over backend servers from the aggregate set of haproxy processes +because the new process balances over the set of backend services without any +coordination with the old processes. + +Dynamic Config Manager addresses this issue again by eliminating the need to +reload the configuration for endpoints changes. Because adding and removing +servers is possible to do through HAProxy's control socket, DCM is able to +eliminate the problem of uneven load resulting from adding or removing +endpoints. + +Note that DCM is not able to prevent imbalance in the event of an endpoints +update combined with an update that requires a configuration reload. DCM also +cannot prevent imbalance in the case of many router pod replicas with relatively +few connections. For example, multiple router pod replicas each receiving a +single request for the same route can all choose the same backend server; router +pod replicas do not coordinate with each other. + +### Goals + +- DCM is enabled by default on OpenShift 4.18 clusters. +- Updating a route's endpoints does not trigger an HAProxy configuration reload. + - Initially, this might be true only for a single scale-out event, or subsequent scale-in event, per route. +- OpenShift router uses marginally more memory and CPU with DCM than without. +- If DCM cannot handle an update, the router forks and reloads, same as before. +- DCM does not cause regressions in throughput, latency, balancing, or metrics. + +### Non-Goals + +- DCM cannot handle *all* route or endpoints configuration changes. + - Initially, adding and removing backends will still require a reload. + - Initially, changing certificates or annotations will still require a reload. + - Initially, multiple successive scale-out events will still require a reload. +- DCM cannot handle *router* configuration changes (that is, *global* options). + - DCM does not handle the `timeout`, `maxconn`, `nbthread`, or `log` options. + - Updates to the router configuration still generally require a pod restart. +- DCM does not coordinate among router pods. + - Traffic can still be unbalanced if multiple router pods use the same server. + +## Proposal + +OpenShift router's Dynamic Config Manager (DCM) was implemented in OpenShift +Enterprise 3.11 with HAProxy 1.8. Although DCM was not previously enabled in +OpenShift Container Platform 4, the implementation was never removed from the +source code. However, it has not been actively developed by engineering or +tested by QA or in CI in OpenShift 4. Therefore, this enhancement proposes the +following steps: + +1. Manually verify that the router functions and passes E2E tests with DCM enabled. +2. Add a tech-preview featuregate in openshift/api for DCM. +3. Update cluster-ingress-operator to enable DCM if the featuregate is enabled. +4. Re-enable old E2E tests in openshift/origin for DCM. +5. Possibly remove outdated logic from DCM. +6. Run E2E, payload tests, and performance tests with DCM enabled. +7. Allow DCM to soak as tech preview for at least 1 OCP release. +8. Possibly add new logic to DCM to exploit new HAProxy 2.y features. +9. Remove the featuregate and mark DCM as GA. + +Steps 5 and 8 are stretch goals and may be done in later OpenShift releases. +The other steps are hard requirements, and we intend to complete them in +OpenShift 4.18. + +### Workflow Description + +Initially, enabling Dynamic Config Manager (DCM) will require enabling a +featuregate. Ultimately, DCM should be enabled by default. This configuration +should be completely transparent to the end-user. The only effect should be +that OpenShift router uses less CPU and memory, balances traffic more evenly +after endpoints updates, and tracks metrics more accurately after route and +endpoints updates. + +#### Variation and form factor considerations [optional] + +DCM should function the same on standalone OpenShift, MicroShift, and +HyperShift. + +### API Extensions + +DCM does not require any API extensions beyond the featuregate. However, it +does have an unsupported config override in the IngressController API in case a +critical issue is discovered after DCM has been enabled by default. + +### Implementation Details/Notes/Constraints [optional] + +DCM was implemented in OpenShift 3. This initial implementation requires +pre-allocating both backends and servers in order to accommodate routes and +endpoints that are created after HAProxy starts. + +Backends are pre-allocated based on *blueprint routes*, which are route objects +that the cluster-admin must create in a designated namespace. A blueprint route +must specify the TLS termination type and set of annotations. Backends and +servers are pre-allocated when HAProxy starts. + +If DCM cannot configure something dynamically, it falls back to the old +fork-and-reload procedure. In particular, the following conditions require +falling back to fork and reload: + +- More routes are created than the number of pre-allocated backends. +- More endpoints are created for a route than the number of pre-allocated server slots for that specific route's backend. +- A route is created that does not match the TLS termination type and annotations of any blueprint route. + +In OpenShift 3, configuring blueprint routes was left up to the cluster-admin to +do. As such, blueprint routes constitute a user-facing API. Additionally, the +cluster-admin could configure the number of server slots to pre-allocate for +each backend. + +Note that pre-allocated server slots and backends take up memory. In a cluster +with many routes, pre-allocating multiple server slots for each backend can take +a considerably large amount of memory. See +https://gist.github.com/frobware/2b527ce3f040797909eff482a4776e0b for an +analysis of the potential memory impact for different choices of balancing +algorithm, number of threads, number of backends, and number of pre-allocated +server slots per backend. + +To avoid operational overhead, avoid adding new APIs, and minimize overhead of +enabling DCM, we will initially omit blueprint routes and only pre-allocate 1 +server slot per backend. This makes DCM an implementation detail; ideally it is +completely invisible to the cluster-admin, except that the router will fork +fewer processes and have more accurate metrics and balancing. Later, we can +enhance DCM by using newer capabilities in HAProxy 2.y to add arbitrarily many +servers dynamically, add backends dynamically, and update certificates +dynamically. + +### Risks and Mitigations + +Dynamic Config Manager has some known defects: + +- [OCPBUGS-7466 No prometheus haproxy metrics present for route created by dynamicConfigManager](https://issues.redhat.com/browse/OCPBUGS-7466) +- [OCPBUGSM-20868 Sticky cookies take long to start working if config manager is enabled](https://issues.redhat.com/browse/OCPBUGSM-20868) +- [NE-1815 Fix implementation gaps discovered during the smoke tests](https://issues.redhat.com/browse/NE-1815) + +We intend to fix all of these defects before enabling DCM by default as GA. +However, DCM could have additional, unknown defects. We will mitigate this risk +through E2E tests, payload tests, and working with partners to test DCM. + +### Drawbacks + +#### Memory overhead + +DCM also has some up-front memory overhead. For this reason, DCM will be +initially enabled with a minimum configuration of 1 pre-allocated server slot +per backend and without blueprint routes. Eventually, DCM will be enhanced to +use newer HAProxy capabilities to perform more configuration without +pre-allocating servers or backends, thus preventing reloads in more cases +without adding any additional up-front memory overhead. + +#### Redundancy with respect to Istio/Envoy + +DCM is similar in concept to [Envoy +xDS](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/dynamic_configuration). +Both enable a control plane (openshift-router or Istio) to configure a data +plane (HAProxy or Envoy, respectively) without restarting processes. Investing +in DCM could be considered a redundant effort when Istio/Envoy already exists +and avoids the problem. However, many customers depend on OpenShift router for +its performance and reliability and will continue using it indefinitely. For +these customers, DCM could improve OpenShift router's performance and +reliability without forcing a change in APIs (from Route API to Gateway API) or +proxies (from HAProxy to Envoy). + +## Design Details + +### Open Questions [optional] + +#### Can we use HAProxy's control socket to add backends and certificates dynamically? + +Answer: **?**. + +We need to determine what kind of configuration changes cannot be implemented in +DCM and either document these limitations or work with HAProxy upstream to +enhance HAProxy's management capabilities and remove these limitations. + +#### Can we configure Dynamic Config Manager with 0 pre-allocated servers? + +Answer: **Probably no, but need verification**. + +One option could be to enable DCM initially with 0 pre-allocated server slots +per backend. If this worked, it would be useful to avoid reloads for scale-in, +or for scale-out following scale-in. However, we found that DCM still requires +a reload for scale-in of a server that doesn't use a pre-allocated server slot. +(TODO: This needs to be double-checked.) + +#### Is the cost of 1 pre-allocated server too high? + +Answer: **Probably no, but needs a release note**. + +Per https://gist.github.com/frobware/2b527ce3f040797909eff482a4776e0b, the cost +of pre-allocating 1 server slot per backend is down to rounding error for 100 +routes, 4 to 9 MB for 1000 routes, and around 40 to 80 MB for 10000 routes. The +cost is highest if the router has many threads and many backends. Most likely, +if the router is configured with many threads and backends, it is already +running on big machines. However, it is important to call out the potential +impact of DCM on memory consumption in a release note. + +#### Are any of the known defects blockers? + +Answer: **Probably yes**. + +Enabling Dynamic Config Manager should not cause any regressions. We need to +fix known issues regarding inaccurate metrics and session stickiness and any +other regressions that we find +(see [Risks and Mitigations](#risks-and-mitigations)). + +#### Do we need an override to turn off Dynamic Config Manager? + +Answer: **Probably yes**. + +Ingress is a critical cluster function, and it is an extremely +performance-sensitive one for some customers. In case we ship DCM enabled and a +customer finds some critical issue, we need an unsupported config override that +support staff can use to turn off DCM temporarily until we fix the issue. + +### Test Plan + +Dynamic Config Manager will be enabled initially using a TechPreviewNoUpgrade +featuregate to provide CI signal. We will additionally revisit existing E2E +tests, enable any DCM-related E2E tests that are currently not enabled, and +verify that all enabled router-related E2E tests pass with DCM. As known +defects are fixed and any unknown ones are discovered, we will add additional +tests. Finally, we will work with partners to verify DCM for their use-cases. + +Expanding test coverage is important both for verifying that DCM is ready for GA +as well as for enabling us to improve DCM in future releases with confidence +that we are not introducing regressions. + +### Graduation Criteria + +We will introduce the feature as TechPreviewNoUpgrade in OpenShift 4.18, with +the goal of graduating it to GA in the same release. Further improvements to +DCM will follow in subsequent OpenShift releases. + +#### Dev Preview -> Tech Preview + +N/A. + +#### Tech Preview -> GA + +- All known regressions are fixed. +- Payload tests pass with DCM enabled. +- At least 2 partners or customers provide test results on DCM's function. +- Memory overhead is within acceptable limits. +- Limitations (such as which changes still require reload) are documented. + +#### Removing a deprecated feature + +N/A. + +### Upgrade / Downgrade Strategy + +The feature requires no specific considerations for upgrade or downgrade; +cluster-ingress-operator will handle upgrading the router image and configuring +Dynamic Config Manager, using the standard process for a rolling update of the +router deployment. + +### Version Skew Strategy + +N/A. + +### Operational Aspects of API Extensions + +N/A. The operation of this feature should be transparent to the end user. + +#### Failure Modes + +If Dynamic Config Manager cannot handle a configuration change, it falls back to +the old fork-and-reload procedure. Thus in the worst case, OpenShift router +should behave no better or worse with DCM than without DCM. + +If an unforeseen defect arises, DCM can be inhibited using an unsupported config +override: + +```shell +oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"unsupportedConfigOverrides":{"dynamicConfigManager":"false"}}}' +``` + +#### Support Procedures + +Metrics and logs are the same as without Dynamic Config Manager: + +- The `reload_seconds`, `reload_failure`, and `write_config_seconds` metrics remain. +- The router pod logs will report reloads as well as any errors from DCM. +- The `dynamicConfigManager` unsupported config override remains available. + +## Implementation History + +- OpenShift Enterprise 3.11 added DCM and the `ROUTER_HAPROXY_CONFIG_MANAGER` option ([release notes](https://docs.openshift.com/container-platform/3.11/release_notes/ocp_3_11_release_notes.html#ocp-311-haproxy-enhancements), [documentation](https://docs.openshift.com/container-platform/3.11/install_config/router/default_haproxy_router.html#using-the-dynamic-configuration-manager)). +- OpenShift Container Platform 4.9 added the `dynamicConfigManager` config override, default off ([openshift/cluster-ingress-operator@6a8516a](https://github.com/openshift/cluster-ingress-operator/pull/628/commits/6a8516ab247b00b87a5d7b32e20d4cffcefe1c0f)). +- OpenShift Container Platform 4.18 enables DCM by default. + +## Alternatives + +### Hard-Stop-After Option + +OpenShift Container Platform 4.7 introduced the `hard-stop-after` option +([openshift/cluster-ingress-operator@7b7327f](https://github.com/openshift/cluster-ingress-operator/pull/522/commits/7b7327fa5e8a48733549ebe1563afc65a871c527), +[documentation](https://docs.openshift.com/container-platform/4.7/networking/routes/route-configuration.html#nw-route-specific-annotations_route-configuration))). +This option causes OpenShift router to terminate old haproxy processes after the +specified duration following a reload as a workaround to prevent old processes +from accumulating. This has the critical drawback that terminating an old +process also terminates any connections that it had open. Additionally, the +duration needs to be tuned to find an acceptable balance between how long +connections are allowed to live and how many processes are permitted to +accumulate. + +### Reload Interval + +The [reload-interval](./haproxy-reload-interval.md) enhancement added an option +to configure the minimum interval for reloads. This can be used alone or in +conjunction with hard-stop-after to limit the accumulation of haproxy processes. +However, it has the trade-off of making the router slower to respond to route or +endpoints updates, it has operational overhead, and it only reduces the +accumulation of processes to a limited degree. + +### Sharding + +The [product documentation advises customers to use sharding](https://docs.openshift.com/container-platform/4.16/scalability_and_performance/optimization/routing-optimization.html) +to avoid having a single router deployment handling too many routes. We also +advise customers to use sharding to separate mission-critical routes from other +routes, or to separate routes with frequent updates from routes that tend to be +involved with long-lived connections. However, configuring sharding requires +advance planning, has high operational overhead, and cannot solve the problem +when the same route has frequent updates and long-lived connections. + +As an example, https://gist.github.com/Miciah/cc308b717a9e8c9b74d3f97393a5827b +demonstrates how to put platform routes and all other routes in separate shards. + +### Alternative proxies, such as Envoy + +Envoy implements a gRPC-based protocol called [xDS](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/dynamic_configuration), +which control-planes such as Contour and Istio use to configure Envoy +dynamically, similar to the way DCM uses HAProxy's control socket. However, +there is no Route API implementation for Contour or Envoy, and migrating +customers from HAProxy is not practical. + +## Infrastructure Needed [optional] + +This feature requires little specific infrastructure. Some performance testing +is required, but this should not require any special hardware.