From 15a3db2301114b9720c9966439a550c26618dbc7 Mon Sep 17 00:00:00 2001
From: Miciah Masters <miciah.masters@gmail.com>
Date: Wed, 25 Sep 2024 23:44:52 -0400
Subject: [PATCH] ingress: Add dynamic-config-manager enhancement

* enhancements/ingress/dynamic-config-manager.md: New file.
---
 .../ingress/dynamic-config-manager.md         | 469 ++++++++++++++++++
 1 file changed, 469 insertions(+)
 create mode 100644 enhancements/ingress/dynamic-config-manager.md

diff --git a/enhancements/ingress/dynamic-config-manager.md b/enhancements/ingress/dynamic-config-manager.md
new file mode 100644
index 0000000000..0f89f9be76
--- /dev/null
+++ b/enhancements/ingress/dynamic-config-manager.md
@@ -0,0 +1,469 @@
+---
+title: dynamic-config-manager
+authors:
+  - "@Miciah"
+reviewers:
+  - TBD
+approvers:
+  - TBD
+api-approvers:
+  - TBD
+creation-date: 2024-09-25
+last-updated: 2024-09-25
+tracking-link:
+  - https://issues.redhat.com/browse/RFE-1439
+  - https://issues.redhat.com/browse/OCPSTRAT-525
+  - https://issues.redhat.com/browse/NE-879
+  - https://issues.redhat.com/browse/OCPSTRAT-422
+  - https://issues.redhat.com/browse/NE-870
+see-also:
+replaces:
+superseded-by:
+---
+
+# OpenShift Router Dynamic Config Manager
+
+## Summary
+
+OpenShift 4.18 enables Dynamic Config Manager with 1 pre-allocated server per
+backend, without blueprint routes, and without any configuration options.  The
+goal is to deliver a minimum viable product on which to iterate.  This MVP
+provides marginal value by avoiding reloading HAProxy for a single scale-out
+event or subsequent scale-in event for a route, at minimal development and
+operational cost.  More importantly, the MVP gives us CI signal, enables us to
+work out defects in DCM, and gives us a starting point from which to enhance DCM
+in subsequent OpenShift releases.  In the future, we intend to extend DCM with
+capabilities such as adding servers dynamically rather than pre-allocating them,
+as well as configuring backends and certificates dynamically, thereby avoiding
+reloading HAProxy for most or all updates to routes or their associated
+endpoints.
+
+## Motivation
+
+OpenShift router has long suffered from issues related to long-lived connections
+and frequent reloads.  The model for updating the router's configuration in
+response to route and endpoints updates is to write out a new `haproxy.config`
+file and reload HAProxy, which forks a new process and keeps the old process
+around until it has closed all the connections that it had open at the time of
+the fork.  This fork-and-reload approach has negative implications for
+performance, metrics, and balancing.  Foremost, when HAProxy is handling
+long-lived connections during repeated configuration updates, old processes
+accumulate and use exorbitant amounts of memory.  Additionally, the
+fork-and-reload approach reduces accuracy of metrics as the metrics are only
+updated for the new process.  For the same reason, the fork-and-reload approach
+reduces the accuracy of HAProxy's load balancing algorithms.
+
+The solution to these issues is to configure HAProxy dynamically.  This is
+exactly what the Dynamic Config Manager (DCM) does: DCM configures a running
+HAProxy process through a Unix domain socket.  This means that no forking is
+necessary to update configuration and effectuate the changes.
+
+However, DCM requires some work before it can be enabled.  DCM was first
+implemented in OpenShift 3.11 and was never enabled by default in OpenShift 3 or
+even allowed as a supported option in OpenShift 4.  This lack of exposure means
+that DCM now needs extensive testing before we can be confident that it is safe
+for production environments.  In addition, DCM in its present form is difficult
+to configure, and it has many cases for which it is not able to handle
+configuration updates.  When dynamic configuration fails, DCM falls back to the
+old fork-and-reload procedure.  Finally, DCM was implemented for HAProxy 1.8 and
+does not take full advantage of the capabilities of newer HAProxy versions.  In
+sum, DCM requires substantial work to develop and verify it in order to make it
+viable.
+
+### User Stories
+
+_As a cluster administrator, I want OpenShift router not to use excessive memory
+when the HAProxy configuration changes and the HAProxy process has many
+long-lived connections._
+
+Without the Dynamic Config Manager, OpenShift router suffers from a well known
+performance issue when the following two conditions are met:
+
+* A router pod reloads its configuration frequently because of route or endpoints updates.
+* The same router pod handles long-lived connections.
+
+Reloading the HAProxy configuration involves forking a new process with the new
+configuration.  The old process remains open until all connections that were
+open at the time of the configuration reload have terminated.  In the case of
+long-lived connections, this means that the old process can remain for a long
+period of time.  If the router has frequent configuration reloads due to changes
+to route configuration, this can cause these old processes to accumulate and use
+a large amount of memory, on the order of hundreds of megabytes or multiple
+gigabytes per process, ultimately causing out-of-memory errors on the node host.
+
+DCM addresses this issue by reducing the need to reload the configuration.
+Instead, DCM configures HAProxy using its Unix domain control socket, which does
+not require forking a new HAProxy process.
+
+The degree to which DCM mitigates this issue is dependent on the nature of the
+configuration changes: Some changes still require a configuration reload, but
+the majority of changes can be performed through HAProxy's control socket.
+
+Initially, DCM will allow scale-out and scale-in of 1 server (pod endpoint) per
+backend (route).  Scaling out more than 1 server will still require a fork and
+reload initially.  In future iterations of the feature, DCM can be enhanced to
+enable scale-out and scale-in of arbitrarily many servers as pods are created
+and deleted, and also scale-out and scale-in of backends as routes are created
+and deleted, as well as changes to certificates and other route options, all
+without requiring a fork and reload.
+
+_As a cluster administrator, I want metrics to be accurate when routes and
+endpoints are updated._
+
+OpenShift router does take care to preserve metrics values across reloads in
+order to avoid resetting counters to zero when the new process starts.  However,
+while values from the old process are preserved, the old process cannot update
+metrics after the reload.  For example, the metrics reflect the total number of
+bytes that the old process had transferred at the point of the configuration
+reload, but if the old process continues transferring data after the new process
+starts, the metrics do not reflect the count of any additional bytes
+transferred.
+
+Dynamic Config Manager addresses this issue again by reducing the need to reload
+the configuration and fork a new process.  Again, the degree to which DCM
+mitigates this issue is dependent on the nature of the configuration changes.
+
+_As a project administrator, I want HAProxy to balance traffic evenly over old
+and new pods when I scale my application up._
+
+HAProxy tracks the number of connections for each of HAProxy's backend servers.
+HAProxy uses this information to balance traffic evenly using the "roundrobin",
+"random", and "leastconn" balancing algorithms.  However, following a reload,
+the new HAProxy process does not have data on how many connections the old
+processes have to each backend server.  This lack of data can cause uneven
+traffic load over backend servers from the aggregate set of haproxy processes
+because the new process balances over the set of backend services without any
+coordination with the old processes.
+
+Dynamic Config Manager addresses this issue again by eliminating the need to
+reload the configuration for endpoints changes.  Because adding and removing
+servers is possible to do through HAProxy's control socket, DCM is able to
+eliminate the problem of uneven load resulting from adding or removing
+endpoints.
+
+Note that DCM is not able to prevent imbalance in the event of an endpoints
+update combined with an update that requires a configuration reload.  DCM also
+cannot prevent imbalance in the case of many router pod replicas with relatively
+few connections.  For example, multiple router pod replicas each receiving a
+single request for the same route can all choose the same backend server; router
+pod replicas do not coordinate with each other.
+
+### Goals
+
+- DCM is enabled by default on OpenShift 4.18 clusters.
+- Updating a route's endpoints does not trigger an HAProxy configuration reload.
+  - Initially, this might be true only for a single scale-out event, or subsequent scale-in event, per route.
+- OpenShift router uses marginally more memory and CPU with DCM than without.
+- If DCM cannot handle an update, the router forks and reloads, same as before.
+- DCM does not cause regressions in throughput, latency, balancing, or metrics.
+
+### Non-Goals
+
+- DCM cannot handle *all* route or endpoints configuration changes.
+  - Initially, adding and removing backends will still require a reload.
+  - Initially, changing certificates or annotations will still require a reload.
+  - Initially, multiple successive scale-out events will still require a reload.
+- DCM cannot handle *router* configuration changes (that is, *global* options).
+  - DCM does not handle the `timeout`, `maxconn`, `nbthread`, or `log` options.
+  - Updates to the router configuration still generally require a pod restart.
+- DCM does not coordinate among router pods.
+  - Traffic can still be unbalanced if multiple router pods use the same server.
+
+## Proposal
+
+OpenShift router's Dynamic Config Manager (DCM) was implemented in OpenShift
+Enterprise 3.11 with HAProxy 1.8.  Although DCM was not previously enabled in
+OpenShift Container Platform 4, the implementation was never removed from the
+source code.  However, it has not been actively developed by engineering or
+tested by QA or in CI in OpenShift 4.  Therefore, this enhancement proposes the
+following steps:
+
+1. Manually verify that the router functions and passes E2E tests with DCM enabled.
+2. Add a tech-preview featuregate in openshift/api for DCM.
+3. Update cluster-ingress-operator to enable DCM if the featuregate is enabled.
+4. Re-enable old E2E tests in openshift/origin for DCM.
+5. Possibly remove outdated logic from DCM.
+6. Run E2E, payload tests, and performance tests with DCM enabled.
+7. Allow DCM to soak as tech preview for at least 1 OCP release.
+8. Possibly add new logic to DCM to exploit new HAProxy 2.y features.
+9. Remove the featuregate and mark DCM as GA.
+
+Steps 5 and 8 are stretch goals and may be done in later OpenShift releases.
+The other steps are hard requirements, and we intend to complete them in
+OpenShift 4.18.
+
+### Workflow Description
+
+Initially, enabling Dynamic Config Manager (DCM) will require enabling a
+featuregate.  Ultimately, DCM should be enabled by default. This configuration
+should be completely transparent to the end-user.  The only effect should be
+that OpenShift router uses less CPU and memory, balances traffic more evenly
+after endpoints updates, and tracks metrics more accurately after route and
+endpoints updates.
+
+#### Variation and form factor considerations [optional]
+
+DCM should function the same on standalone OpenShift, MicroShift, and
+HyperShift.
+
+### API Extensions
+
+DCM does not require any API extensions beyond the featuregate.  However, it
+does have an unsupported config override in the IngressController API in case a
+critical issue is discovered after DCM has been enabled by default.
+
+### Implementation Details/Notes/Constraints [optional]
+
+DCM was implemented in OpenShift 3.  This initial implementation requires
+pre-allocating both backends and servers in order to accommodate routes and
+endpoints that are created after HAProxy starts.
+
+Backends are pre-allocated based on *blueprint routes*, which are route objects
+that the cluster-admin must create in a designated namespace.  A blueprint route
+must specify the TLS termination type and set of annotations.  Backends and
+servers are pre-allocated when HAProxy starts.
+
+If DCM cannot configure something dynamically, it falls back to the old
+fork-and-reload procedure.  In particular, the following conditions require
+falling back to fork and reload:
+
+- More routes are created than the number of pre-allocated backends.
+- More endpoints are created for a route than the number of pre-allocated server slots for that specific route's backend.
+- A route is created that does not match the TLS termination type and annotations of any blueprint route.
+
+In OpenShift 3, configuring blueprint routes was left up to the cluster-admin to
+do.  As such, blueprint routes constitute a user-facing API.  Additionally, the
+cluster-admin could configure the number of server slots to pre-allocate for
+each backend.
+
+Note that pre-allocated server slots and backends take up memory.  In a cluster
+with many routes, pre-allocating multiple server slots for each backend can take
+a considerably large amount of memory.  See
+https://gist.github.com/frobware/2b527ce3f040797909eff482a4776e0b for an
+analysis of the potential memory impact for different choices of balancing
+algorithm, number of threads, number of backends, and number of pre-allocated
+server slots per backend.
+
+To avoid operational overhead, avoid adding new APIs, and minimize overhead of
+enabling DCM, we will initially omit blueprint routes and only pre-allocate 1
+server slot per backend.  This makes DCM an implementation detail; ideally it is
+completely invisible to the cluster-admin, except that the router will fork
+fewer processes and have more accurate metrics and balancing.  Later, we can
+enhance DCM by using newer capabilities in HAProxy 2.y to add arbitrarily many
+servers dynamically, add backends dynamically, and update certificates
+dynamically.
+
+### Risks and Mitigations
+
+Dynamic Config Manager has some known defects:
+
+- [OCPBUGS-7466 No prometheus haproxy metrics present for route created by dynamicConfigManager](https://issues.redhat.com/browse/OCPBUGS-7466)
+- [OCPBUGSM-20868 Sticky cookies take long to start working if config manager is enabled](https://issues.redhat.com/browse/OCPBUGSM-20868)
+- [NE-1815 Fix implementation gaps discovered during the smoke tests](https://issues.redhat.com/browse/NE-1815)
+
+We intend to fix all of these defects before enabling DCM by default as GA.
+However, DCM could have additional, unknown defects.  We will mitigate this risk
+through E2E tests, payload tests, and working with partners to test DCM.
+
+### Drawbacks
+
+#### Memory overhead
+
+DCM also has some up-front memory overhead.  For this reason, DCM will be
+initially enabled with a minimum configuration of 1 pre-allocated server slot
+per backend and without blueprint routes.  Eventually, DCM will be enhanced to
+use newer HAProxy capabilities to perform more configuration without
+pre-allocating servers or backends, thus preventing reloads in more cases
+without adding any additional up-front memory overhead.
+
+#### Redundancy with respect to Istio/Envoy
+
+DCM is similar in concept to [Envoy
+xDS](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/dynamic_configuration).
+Both enable a control plane (openshift-router or Istio) to configure a data
+plane (HAProxy or Envoy, respectively) without restarting processes.  Investing
+in DCM could be considered a redundant effort when Istio/Envoy already exists
+and avoids the problem.  However, many customers depend on OpenShift router for
+its performance and reliability and will continue using it indefinitely.  For
+these customers, DCM could improve OpenShift router's performance and
+reliability without forcing a change in APIs (from Route API to Gateway API) or
+proxies (from HAProxy to Envoy).
+
+## Design Details
+
+### Open Questions [optional]
+
+#### Can we use HAProxy's control socket to add backends and certificates dynamically?
+
+Answer: **?**.
+
+We need to determine what kind of configuration changes cannot be implemented in
+DCM and either document these limitations or work with HAProxy upstream to
+enhance HAProxy's management capabilities and remove these limitations.
+
+#### Can we configure Dynamic Config Manager with 0 pre-allocated servers?
+
+Answer: **Probably no, but need verification**.
+
+One option could be to enable DCM initially with 0 pre-allocated server slots
+per backend.  If this worked, it would be useful to avoid reloads for scale-in,
+or for scale-out following scale-in.  However, we found that DCM still requires
+a reload for scale-in of a server that doesn't use a pre-allocated server slot.
+(TODO: This needs to be double-checked.)
+
+#### Is the cost of 1 pre-allocated server too high?
+
+Answer: **Probably no, but needs a release note**.
+
+Per https://gist.github.com/frobware/2b527ce3f040797909eff482a4776e0b, the cost
+of pre-allocating 1 server slot per backend is down to rounding error for 100
+routes, 4 to 9 MB for 1000 routes, and around 40 to 80 MB for 10000 routes.  The
+cost is highest if the router has many threads and many backends.  Most likely,
+if the router is configured with many threads and backends, it is already
+running on big machines.  However, it is important to call out the potential
+impact of DCM on memory consumption in a release note.
+
+#### Are any of the known defects blockers?
+
+Answer: **Probably yes**.
+
+Enabling Dynamic Config Manager should not cause any regressions.  We need to
+fix known issues regarding inaccurate metrics and session stickiness and any
+other regressions that we find
+(see [Risks and Mitigations](#risks-and-mitigations)).
+
+#### Do we need an override to turn off Dynamic Config Manager?
+
+Answer: **Probably yes**.
+
+Ingress is a critical cluster function, and it is an extremely
+performance-sensitive one for some customers.  In case we ship DCM enabled and a
+customer finds some critical issue, we need an unsupported config override that
+support staff can use to turn off DCM temporarily until we fix the issue.
+
+### Test Plan
+
+Dynamic Config Manager will be enabled initially using a TechPreviewNoUpgrade
+featuregate to provide CI signal.  We will additionally revisit existing E2E
+tests, enable any DCM-related E2E tests that are currently not enabled, and
+verify that all enabled router-related E2E tests pass with DCM.  As known
+defects are fixed and any unknown ones are discovered, we will add additional
+tests.  Finally, we will work with partners to verify DCM for their use-cases.
+
+Expanding test coverage is important both for verifying that DCM is ready for GA
+as well as for enabling us to improve DCM in future releases with confidence
+that we are not introducing regressions.
+
+### Graduation Criteria
+
+We will introduce the feature as TechPreviewNoUpgrade in OpenShift 4.18, with
+the goal of graduating it to GA in the same release.  Further improvements to
+DCM will follow in subsequent OpenShift releases.
+
+#### Dev Preview -> Tech Preview
+
+N/A.
+
+#### Tech Preview -> GA
+
+- All known regressions are fixed.
+- Payload tests pass with DCM enabled.
+- At least 2 partners or customers provide test results on DCM's function.
+- Memory overhead is within acceptable limits.
+- Limitations (such as which changes still require reload) are documented.
+
+#### Removing a deprecated feature
+
+N/A.
+
+### Upgrade / Downgrade Strategy
+
+The feature requires no specific considerations for upgrade or downgrade;
+cluster-ingress-operator will handle upgrading the router image and configuring
+Dynamic Config Manager, using the standard process for a rolling update of the
+router deployment.
+
+### Version Skew Strategy
+
+N/A.
+
+### Operational Aspects of API Extensions
+
+N/A.  The operation of this feature should be transparent to the end user.
+
+#### Failure Modes
+
+If Dynamic Config Manager cannot handle a configuration change, it falls back to
+the old fork-and-reload procedure.  Thus in the worst case, OpenShift router
+should behave no better or worse with DCM than without DCM.
+
+If an unforeseen defect arises, DCM can be inhibited using an unsupported config
+override:
+
+```shell
+oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"unsupportedConfigOverrides":{"dynamicConfigManager":"false"}}}'
+```
+
+#### Support Procedures
+
+Metrics and logs are the same as without Dynamic Config Manager:
+
+- The `reload_seconds`, `reload_failure`, and `write_config_seconds` metrics remain.
+- The router pod logs will report reloads as well as any errors from DCM.
+- The `dynamicConfigManager` unsupported config override remains available.
+
+## Implementation History
+
+- OpenShift Enterprise 3.11 added DCM and the `ROUTER_HAPROXY_CONFIG_MANAGER` option ([release notes](https://docs.openshift.com/container-platform/3.11/release_notes/ocp_3_11_release_notes.html#ocp-311-haproxy-enhancements), [documentation](https://docs.openshift.com/container-platform/3.11/install_config/router/default_haproxy_router.html#using-the-dynamic-configuration-manager)).
+- OpenShift Container Platform 4.9 added the `dynamicConfigManager` config override, default off ([openshift/cluster-ingress-operator@6a8516a](https://github.com/openshift/cluster-ingress-operator/pull/628/commits/6a8516ab247b00b87a5d7b32e20d4cffcefe1c0f)).
+- OpenShift Container Platform 4.18 enables DCM by default.
+
+## Alternatives
+
+### Hard-Stop-After Option
+
+OpenShift Container Platform 4.7 introduced the `hard-stop-after` option
+([openshift/cluster-ingress-operator@7b7327f](https://github.com/openshift/cluster-ingress-operator/pull/522/commits/7b7327fa5e8a48733549ebe1563afc65a871c527),
+[documentation](https://docs.openshift.com/container-platform/4.7/networking/routes/route-configuration.html#nw-route-specific-annotations_route-configuration))).
+This option causes OpenShift router to terminate old haproxy processes after the
+specified duration following a reload as a workaround to prevent old processes
+from accumulating.  This has the critical drawback that terminating an old
+process also terminates any connections that it had open.  Additionally, the
+duration needs to be tuned to find an acceptable balance between how long
+connections are allowed to live and how many processes are permitted to
+accumulate.
+
+### Reload Interval
+
+The [reload-interval](./haproxy-reload-interval.md) enhancement added an option
+to configure the minimum interval for reloads.  This can be used alone or in
+conjunction with hard-stop-after to limit the accumulation of haproxy processes.
+However, it has the trade-off of making the router slower to respond to route or
+endpoints updates, it has operational overhead, and it only reduces the
+accumulation of processes to a limited degree.
+
+### Sharding
+
+The [product documentation advises customers to use sharding](https://docs.openshift.com/container-platform/4.16/scalability_and_performance/optimization/routing-optimization.html)
+to avoid having a single router deployment handling too many routes.  We also
+advise customers to use sharding to separate mission-critical routes from other
+routes, or to separate routes with frequent updates from routes that tend to be
+involved with long-lived connections.  However, configuring sharding requires
+advance planning, has high operational overhead, and cannot solve the problem
+when the same route has frequent updates and long-lived connections.
+
+As an example, https://gist.github.com/Miciah/cc308b717a9e8c9b74d3f97393a5827b
+demonstrates how to put platform routes and all other routes in separate shards.
+
+### Alternative proxies, such as Envoy
+
+Envoy implements a gRPC-based protocol called [xDS](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/dynamic_configuration),
+which control-planes such as Contour and Istio use to configure Envoy
+dynamically, similar to the way DCM uses HAProxy's control socket.  However,
+there is no Route API implementation for Contour or Envoy, and migrating
+customers from HAProxy is not practical.
+
+## Infrastructure Needed [optional]
+
+This feature requires little specific infrastructure.  Some performance testing
+is required, but this should not require any special hardware.