Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPNODE-2842: Set Upgradeable=False when cluster is on cgroup v1 #4822

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sairameshv
Copy link
Member

@sairameshv sairameshv commented Jan 30, 2025

- What I did
Added code to set the cluster operator's status to Upgradeable=False when a cluster is found to be on cgroup v1

- How to verify it
Update the CgroupMode field of ndes.config object to v1 and verify that the MCO cluster operator's status has the Upgradeable=False condition

- Description for the changelog

Cgroupsv1 support deprecation condition/message has been added in the last release and this PR helps in updating all the clusters to cgroup v2 before upgrading to ocp 4.19

References:

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 30, 2025

@sairameshv: This pull request references OCPNODE-2843 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

- What I did
Removed support to configure cgroupsv1 from nodes.config object

- How to verify it
Update the CgroupMode field of ndes.config object and verify that the system errors and doesn't update the nodes with the cgroupv1 based kernelArgs

- Description for the changelog

Cgroupsv1 support deprecation condition/message has been added in the last release and this PR removes the support to configure the same.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 30, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 30, 2025
Copy link
Contributor

openshift-ci bot commented Jan 30, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@sairameshv
Copy link
Member Author

/jira refresh
/test all

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 30, 2025

@sairameshv: This pull request references OCPNODE-2843 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

/jira refresh
/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Member Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 30, 2025

@sairameshv: This pull request references OCPNODE-2843 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Member Author

Related openshift/api PR: openshift/api#2181

@sairameshv sairameshv force-pushed the remove_cgrpv1 branch 2 times, most recently from 3e7f5b7 to d34b287 Compare February 20, 2025 09:46
@sairameshv sairameshv force-pushed the remove_cgrpv1 branch 2 times, most recently from 25db47f to a2b8f7c Compare March 4, 2025 16:33
@sairameshv sairameshv changed the title OCPNODE-2843: Remove cgroupv1 configurtion support from OCP OCPNODE-2276: Set Upgradeable=False when cluster is on cgroup v1 Mar 4, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 4, 2025

@sairameshv: This pull request references OCPNODE-2276 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

- What I did
Removed support to configure cgroupsv1 from nodes.config object

- How to verify it
Update the CgroupMode field of ndes.config object and verify that the system errors and doesn't update the nodes with the cgroupv1 based kernelArgs

- Description for the changelog

Cgroupsv1 support deprecation condition/message has been added in the last release and this PR removes the support to configure the same.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 4, 2025

@sairameshv: This pull request references OCPNODE-2276 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

- What I did
Added code to set the cluster operator's status to Upgradeable=False when a cluster is found to be on cgroup v1

- How to verify it
Update the CgroupMode field of ndes.config object to v1 and verify that the MCO cluster operator's status has the Upgradeable=False condition

- Description for the changelog

Cgroupsv1 support deprecation condition/message has been added in the last release and this PR helps in updating all the clusters to cgroup v2 before upgrading to ocp 4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Member Author

/jira refresh
/test all

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 4, 2025

@sairameshv: This pull request references OCPNODE-2276 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

/jira refresh
/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Member Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 4, 2025

@sairameshv: This pull request references OCPNODE-2276 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Member Author

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 4, 2025

@sairameshv: This pull request references OCPNODE-2276 which is a valid jira issue.

In response to this:

- What I did
Added code to set the cluster operator's status to Upgradeable=False when a cluster is found to be on cgroup v1

- How to verify it
Update the CgroupMode field of ndes.config object to v1 and verify that the MCO cluster operator's status has the Upgradeable=False condition

- Description for the changelog

Cgroupsv1 support deprecation condition/message has been added in the last release and this PR helps in updating all the clusters to cgroup v2 before upgrading to ocp 4.19

References:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv sairameshv force-pushed the remove_cgrpv1 branch 2 times, most recently from aea470d to 0fee7e0 Compare March 5, 2025 11:49
@sairameshv sairameshv marked this pull request as ready for review March 5, 2025 11:49
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2025
@openshift-ci openshift-ci bot requested review from LorbusChris and yuqi-zhang March 5, 2025 11:50
@sairameshv sairameshv changed the title OCPNODE-2276: Set Upgradeable=False when cluster is on cgroup v1 OCPNODE-2842: Set Upgradeable=False when cluster is on cgroup v1 Mar 5, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 5, 2025

@sairameshv: This pull request references OCPNODE-2842 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

- What I did
Added code to set the cluster operator's status to Upgradeable=False when a cluster is found to be on cgroup v1

- How to verify it
Update the CgroupMode field of ndes.config object to v1 and verify that the MCO cluster operator's status has the Upgradeable=False condition

- Description for the changelog

Cgroupsv1 support deprecation condition/message has been added in the last release and this PR helps in updating all the clusters to cgroup v2 before upgrading to ocp 4.19

References:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments inline

if configNode.Spec.CgroupMode == configv1.CgroupModeV1 {
coStatusCondition.Status = configv1.ConditionFalse
coStatusCondition.Reason = "ClusterOnCgroupV1"
coStatusCondition.Message = "Cluster is using cgroup v1 and is not upgradable. Please update the `CgroupMode` in the `nodes.config.openshift.io` object to 'v2'. Once upgraded, the cluster cannot be changed back to cgroup v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the message, perhaps it would be good to mention the deprecation of v1 as well, just to clarify why it's not upgradeable (does it make sense to link to anything here?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added deprecated cgroup v1 in the message.
We already have the deprecation of cgroup v1 message in the config node object's status condition for the last few releases.
We have a reference of OCPSTRAT for this feature as of now. We don't have a concrete link to when exactly the RHCOS would be removing the support for cgroup v1 i.e. it's going to be done in the future releases (may be by 4.20?). So, I don't think we have a link at the moment to convey in the message.

CgroupMode `v1` is not supported in OCP-4.19
This change helps in preventing clusters to upgrade to 4.19 before updating the
cluster to cgroupMode `v2`

Signed-off-by: Sai Ramesh Vanka <[email protected]>
@sairameshv
Copy link
Member Author

/retest

@sairameshv sairameshv requested a review from yuqi-zhang March 6, 2025 17:01
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the patch looks from the MCO side. There are quite a bit of test failures, although they seem unrelated?

/retest-required

Copy link
Contributor

openshift-ci bot commented Mar 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sairameshv, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2025
@sairameshv
Copy link
Member Author

/retest

@sairameshv
Copy link
Member Author

I think the patch looks from the MCO side. There are quite a bit of test failures, although they seem unrelated?

/retest-required

Yes, definitely the errors are not related to these changes.

if configNode.Spec.CgroupMode == configv1.CgroupModeV1 {
coStatusCondition.Status = configv1.ConditionFalse
coStatusCondition.Reason = "ClusterOnCgroupV1"
coStatusCondition.Message = "Cluster is using deprecated cgroup v1 and is not upgradable. Please update the `CgroupMode` in the `nodes.config.openshift.io` object to 'v2'. Once upgraded, the cluster cannot be changed back to cgroup v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition can be overridden by the pools. We may want to update the status and return. @yuqi-zhang ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking we should append coStatusCondition for cgroupv1 to any pools that are degraded and are preventing the upgrade as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so the status here should only be overridden by the pools if a pool is degraded, which I thought was the original intent. I think fine if degraded pools took priority, and once that is fixed, the user then sees the cgroupsv1 failure. I guess if we wanted to make it that way we should do it explicitly by ordering this after the pool degrade reporting and have both conditions exit directly.

Based on your second comment, you're looking for something like: if pool degraded && cgroups on v1, then error with Reason = "ClusterOnCgroupV1AndPoolDegraded"? As far as I know the upgradeable condition is not a list of failed reason but rather a singular, but I'm happy to see that change if there's a more elegant way to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple blocking CFE conditions separated with a :: delimiter. Similar to what can be found here. IMO is probably worth the modification to return a list of degraded conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This PR is modifying the status conditions... Shouldn't it be modifying the CFE conditions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a CFE evaluation based on the cgroupv1 and that's setting the condition as well here
Is there anything that I can add to these pieces of code?

@sairameshv
Copy link
Member Author

/retest-required

@haircommander
Copy link
Member

/retest

Copy link
Contributor

openshift-ci bot commented Mar 10, 2025

@sairameshv: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-upgrade-out-of-change 3f3df1b link false /test e2e-aws-ovn-upgrade-out-of-change
ci/prow/e2e-vsphere-ovn-zones 3f3df1b link false /test e2e-vsphere-ovn-zones
ci/prow/e2e-aws-ovn-upgrade 3f3df1b link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-gcp-op 3f3df1b link true /test e2e-gcp-op
ci/prow/4.12-upgrade-from-stable-4.11-images 3f3df1b link true /test 4.12-upgrade-from-stable-4.11-images
ci/prow/e2e-aws-workers-rhel8 3f3df1b link false /test e2e-aws-workers-rhel8
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade 3f3df1b link false /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
ci/prow/e2e-azure-ovn-upgrade 3f3df1b link false /test e2e-azure-ovn-upgrade
ci/prow/okd-e2e-upgrade 3f3df1b link false /test okd-e2e-upgrade
ci/prow/okd-e2e-vsphere 3f3df1b link false /test okd-e2e-vsphere
ci/prow/e2e-aws-disruptive 3f3df1b link false /test e2e-aws-disruptive
ci/prow/e2e-aws-ovn-workers-rhel8 3f3df1b link false /test e2e-aws-ovn-workers-rhel8
ci/prow/okd-e2e-aws 3f3df1b link false /test okd-e2e-aws
ci/prow/e2e-ovirt 3f3df1b link false /test e2e-ovirt
ci/prow/okd-images 3f3df1b link true /test okd-images
ci/prow/okd-e2e-gcp-op 3f3df1b link false /test okd-e2e-gcp-op
ci/prow/e2e-azure-ovn-upgrade-out-of-change 3f3df1b link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-metal-ipi-ovn-dualstack 3f3df1b link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/cluster-bootimages 3f3df1b link true /test cluster-bootimages
ci/prow/e2e-vsphere-ovn-upi 3f3df1b link false /test e2e-vsphere-ovn-upi
ci/prow/e2e-ovirt-upgrade 3f3df1b link false /test e2e-ovirt-upgrade
ci/prow/e2e-gcp-op-ocl 3f3df1b link false /test e2e-gcp-op-ocl
ci/prow/e2e-aws-single-node 3f3df1b link false /test e2e-aws-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Contributor

Comment aside, prow changing master to main inadvertently turned on some old tests, feel free to ignore those

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants