Skip to content

Commit

Permalink
docs: further details (GWAPI CRD lifecycle mgmt)
Browse files Browse the repository at this point in the history
Signed-off-by: Shane Utt <[email protected]>
  • Loading branch information
shaneutt committed Feb 17, 2025
1 parent 311f127 commit a4db666
Showing 1 changed file with 29 additions and 72 deletions.
101 changes: 29 additions & 72 deletions enhancements/ingress/gateway-api-crd-life-cycle-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ creation-date: 2025-01-22
last-updated: 2025-01-27
tracking-link:
- https://issues.redhat.com/browse/NE-1946
status: provisional
see-also:
- "/enhancements/ingress/gateway-api-with-cluster-ingress-operator.md"
---
Expand Down Expand Up @@ -71,7 +70,7 @@ As a cluster-admin, I want to install a third-party Gateway API implementation
on my OpenShift 4.19 cluster, and use the third-party implementation without
any interference from the first-party implementation. Relatedly I want to be
able to utilize both the first-party and any third-party solution alongside
eachother simultaneously and independently without any interference between the
each other simultaneously and independently without any interference between the
two.

#### Future OpenShift upgrades
Expand Down Expand Up @@ -180,7 +179,7 @@ goes wrong.
> starting state and then list the steps that the user would need to go through to
> trigger the feature described in the enhancement. Optionally add a
> [mermaid](https://github.com/mermaid-js/mermaid#readme) sequence diagram.
>
>
> Use sub-sections to explain variations, such as for error handling,
> failure recovery, or alternative outcomes.
Expand Down Expand Up @@ -341,6 +340,10 @@ N/A.

N/A.

## Version Skew Strategy

> **Note**:see operational aspects of API extensions below.
## Upgrade / Downgrade Strategy

> If applicable, how will the component be upgraded and downgraded? Make sure this
Expand Down Expand Up @@ -382,85 +385,39 @@ N/A.
> CVO does not currently delete resources that no longer exist in
> the target version.
## Version Skew Strategy
## Operational Aspects of API Extensions

> How will the component handle version skew with other components?
> What are the guarantees? Make sure this is in the test plan.
>
> Consider the following in developing a version skew strategy for this
> enhancement:
> - During an upgrade, we will always have skew among components, how will this impact your work?
> - Does this enhancement involve coordinating behavior in the control plane and
> in the kubelet? How does an n-2 kubelet without this feature available behave
> when this feature is used?
> - Will any other components on the node change? For example, changes to CSI, CRI
> or CNI may require updating that component before the kubelet.
Other products and components that have Gateway API support will now be able to
consistently know that Gateway API will already be present on the cluster, and
which version will be present given the version of OpenShift. There will no
longer be a need for them to document having their users deploy the CRDs
manually or do any management themselves that could conflict.

_TBD: Do we describe version skew with layered products here?_
We are already aware of several projects which utilize Gateway API including
(but not limited to):

## Operational Aspects of API Extensions
* OpenShift Service Mesh
* Kuadrant
* OpenShift AI Serving

> Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs,
> admission and conversion webhooks, aggregated API servers, finalizers) here in detail,
> especially how they impact the OCP system architecture and operational aspects.
>
> - For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level
> Indicators) an administrator or support can use to determine the health of the API extensions
>
> Examples (metrics, alerts, operator conditions)
> - authentication-operator condition `APIServerDegraded=False`
> - authentication-operator condition `APIServerAvailable=True`
> - openshift-authentication/oauth-apiserver deployment and pods health
>
> - What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput,
> API availability)
>
> Examples:
> - Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average.
> - Fails creation of ConfigMap in the system when the webhook is not available.
> - Adds a dependency on the SDN service network for all resources, risking API availability in case
> of SDN issues.
> - Expected use-cases require less than 1000 instances of the CRD, not impacting
> general API throughput.
>
> - How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or
> automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review
> this enhancement)
>
> - Describe the possible failure modes of the API extensions.
> - Describe how a failure or behaviour of the extension will impact the overall cluster health
> (e.g. which kube-controller-manager functionality will stop working), especially regarding
> stability, availability, performance and security.
> - Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes
> and add them as reviewers to this enhancement.
_TBD: Do we need to describe anything here?_
We will coordinate with these projects and others from release to release on
their needs related to Gateway API version support. We expect over time that
more flexibility with the version will eventually be needed, and we anticipate
adding ranges of support instead of specific versions to accomodate this.

## Support Procedures

### Conflicting CRDs

If the Ingress Operator detects the presence of a conflicting version of the
Gateway API CRDs, it updates the ingress clusteroperator to report a `Degraded`
status condition with status `True` and a message explaining the situation:

_TBD: Insert example output from `oc get clusteroperators/ingress -o yaml`._

In this situation, the cluster-admin is expected to verify that workload would
not be broken by handing life-cycle management of the CRDs over to the Ingress
Operator:

_TBD: Insert `oc` command to make the CRD ownership transition._

Then the Ingress Operator takes ownership and updates the CRDs:

_TBD: Insert example `oc get clusteroperators` and `oc get crds` commands._

### Overriding the Ingress Operator
The pre-upgrade checks should eliminate any problems with CRD conflicts.
However it is always _technically possible_ for the admin to force through both
the pre-upgrade check AND the admin gate. If they do this the CIO will detect
the mismatching schema and report a `Degraded` status condition with status
`True` and a message explaining the problem.

_TBD: Should we describe how to turn off the Ingress Operator so that the
cluster-admin can override the CRDs, or describe how Server-Side Apply enables
the cluster-admin to take over the CRDs?_
In this situation the cluster-admin then has to go back and follow the upgrade
instructions regarding Gateway API CRDs correctly and fix the state on the
cluster before we can move out of degraded.

## Alternatives

Expand Down

0 comments on commit a4db666

Please sign in to comment.