-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NE-1946: ingress: Add gateway-api-crd-life-cycle-management #1740
NE-1946: ingress: Add gateway-api-crd-life-cycle-management #1740
Conversation
/assign @shaneutt |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great foundation, thank you for putting this together Miciah.
@JoelSpeed @knobunc we would appreciate if you would review what's here so far, but note that our intention is to iterate on the TODOs and TBDs in follow up PRs as doing a few separate, smaller and purpose-focused PRs should help reviewing.
OpenShift's Ingress Operator monitors for the presence of the Gateway API CRDs. | ||
|
||
- If they are absent, it installs the appropriate version. | ||
- If they are present at the expected version, the operator does nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "do nothing" also implies we will use these CRDs I think we'll need to think on that more because we need to manage their lifecycle, and we would need to know who the owner is and have that consent. This however I can add in one of the follow up iterations, so a TODO here should be good:
- If they are present at the expected version, the operator does nothing. | |
- If they are present at the expected version: TODO - expand upon what we'll do for preexisting CRDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you alluding to transferring ownership with Server-Side Apply? Now that you mention it, I think the proposal should be rewritten in terms of who owns it rather than whether it is at the expected version. Should I go ahead and s/at the expected version/with the Ingress Operator as the owner/ now, with a to-do to fill in the SSA-related details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that sounds great 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote in terms of SSA and added a to-do in https://github.com/openshift/enhancements/compare/99aae59bb1ac50f1ef02e07ff3dcc3385216c451..e34441f0d0685989e6417db70dcf049c68f9fc55.
OpenShift release. Specifically, the CRDs SHOULD be the version corresponding | ||
to the version of Istio in that OSSM version; they MUST be of some version that | ||
is compatible and that we have tested with that version of Istio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenShift release. Specifically, the CRDs SHOULD be the version corresponding | |
to the version of Istio in that OSSM version; they MUST be of some version that | |
is compatible and that we have tested with that version of Istio. | |
OpenShift release. Specifically, the CRDs SHOULD be the version corresponding | |
to the version of Istio in that OSSM version; they MUST be of some version that | |
is compatible and that we have tested with that version of Istio (TODO: expand upon this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in https://github.com/openshift/enhancements/compare/99aae59bb1ac50f1ef02e07ff3dcc3385216c451..e34441f0d0685989e6417db70dcf049c68f9fc55. Is that sufficient? If we need more detail, it might warrant a separate section for how we validate a particular CRD version.
|
||
Potential mitigation strategies: | ||
|
||
- Handle with an admission webhook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just a note that we put handling updates to dead fields being handled by a webhook as a non-goal above)
We have a dilemma when it comes to trying to handle dead fields with a webhook, because our validation would require us to ultimately reference other objects. Something we need to think through, but we can follow up on that as part of the TBD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a cross-reference to the "Admission webhook" section under "Alternatives", and I added a paragraph about consistency issues to that section. I left the TBD to fill in more details. These changes are in https://github.com/openshift/enhancements/compare/99aae59bb1ac50f1ef02e07ff3dcc3385216c451..e34441f0d0685989e6417db70dcf049c68f9fc55.
Note that Gateway API CRD life-cycle management will be part of the | ||
[gateway-api-with-cluster-ingress-operator](gateway-api-with-cluster-ingress-operator.md) enhancement, which defines its | ||
own graduation criteria. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a conversation before about potentially packaging the CRDs in at the CVO level, I believe the main problem we had with that was limitations to how we can handle conflict resolution. We should capture somewhere in this doc (alternatives considered, etc) the reasons why we moved away from that for posterity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a heading in https://github.com/openshift/enhancements/compare/99aae59bb1ac50f1ef02e07ff3dcc3385216c451..e34441f0d0685989e6417db70dcf049c68f9fc55. I don't actually understand how using CVO would help, so I haven't written anything out yet. I can go back through Jira comments and meeting notes to try to figure that out, or else feel free to write something up about that alternative.
99aae59
to
e34441f
Compare
As a cluster-admin, I want to upgrade my cluster from OpenShift 4.18 (which | ||
*doesn't* manage the Gateway API CRDs' life-cycle) to OpenShift 4.19 (which | ||
*does*). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want a case (perhaps this one?) where the story includes having a third party implementation of gateway API already installed in 4.18. This story at the moment sounds perhaps incomplete as I don't see the impact unless there is a gateway API installed on the 4.18 side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I figured the user story could describe two cases: CRDs were not installed on the 4.18 cluster, and they were. I can split it into two user stories if that makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it might make sense to have one explicitly call out that there were Gateway API CRDs installed in 4.18
The Ingress Operator's warning will take the form of, at a minimum, setting the | ||
`Degraded` status condition's status to `True` with a descriptive message on the | ||
ingress clusteroperator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we will need to explain how an admin gets out of this, TODO delete this comment if I find that later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figure it can be documented under "Workflow Description" or "Support Procedures". I can add a cross-reference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be helpful, thank you
In this situation, the cluster-admin is expected to verify that workload would | ||
not be broken by handing life-cycle management of the CRDs over to the Ingress | ||
Operator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they cannot verify this, or I guess, they verify it and determine their workload would be broken, we need to explain that path too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we will probably need to describe how workload could be broken (e.g. third-party controller fails to start because it needs newer CRDs), how to predict or detect it, and how to handle it (even if it comes down to, "Work with Red Hat and the third-party vendor to align on a common CRD version.").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect in reality the issue is that either you end up with dead fields, or the controller starts stripping fields it doesn't understand, or it starts trying to default/set a field that doesn't exist within the current CRD, and then breaks/continuously reconciles because the value isn't ever set, anything else? Explaining these potential issues is probably useful for the tech writer who will have to write the release notes 😅
_TBD: Should we describe how to turn off the Ingress Operator so that the | ||
cluster-admin can override the CRDs, or describe how Server-Side Apply enables | ||
the cluster-admin to take over the CRDs?_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like you might need what we are doing for CAPI in this space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping we don't need to implement an explicit override [for GA]. Even if the alternative is something a little complicated (involving SSA) or heavy-handed (such as using a CVO override and scaling down the whole operator), we just need something good enough for now and can defer doing something better.
e34441f
to
42d946a
Compare
42d946a
to
1ec9f49
Compare
* enhancements/ingress/gateway-api-crd-life-cycle-management.md: New file. * enhancements/ingress/gateway-api-with-cluster-ingress-operator.md: Add cross-reference to the new file.
1ec9f49
to
a5f5545
Compare
@Miciah: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@Miciah: This pull request references NE-1946 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/close |
@Miciah: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Continued in |
No description provided.