Skip to content

MHC does not watch remediation #364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 20, 2025

Conversation

mshitrit
Copy link
Member

Why we need this PR

MHC does not watch the remediation CR, which in turn will cause it to miss the remediator (for example) removing the finalizer and will not trigger the lease removal and the deletion of the CR.
This issue also causes the CI e2e tests to fail (or at least be extremely flaky)

Changes made

Extracted NHC watch remediation/template to a new component
Use this component both by NHC and MHC in order to prevent code duplication

Which issue(s) this PR fixes

ECOPROJECT-2187

Test plan

Copy link
Contributor

openshift-ci bot commented Mar 17, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@mshitrit mshitrit changed the title [WIP] Mhc watch remediation [WIP] MHC does not watch remediation Mar 17, 2025
@mshitrit mshitrit force-pushed the mhc-watch-remediation branch from a8d9b74 to b18cffa Compare March 17, 2025 19:23
@mshitrit
Copy link
Member Author

/test ?

Copy link
Contributor

openshift-ci bot commented Mar 17, 2025

@mshitrit: No presubmit jobs available for medik8s/node-healthcheck-operator@main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mshitrit
Copy link
Member Author

/test 4.17-openshift-e2e

Export Watcher to interface to support unit tests

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit mshitrit force-pushed the mhc-watch-remediation branch from b18cffa to 94f2c21 Compare March 26, 2025 09:33
@mshitrit
Copy link
Member Author

/test 4.17-openshift-e2e

1 similar comment
@mshitrit
Copy link
Member Author

/test 4.17-openshift-e2e

@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

@mshitrit
Copy link
Member Author

/retest

@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

4 similar comments
@slintes
Copy link
Member

slintes commented Mar 28, 2025

/test 4.16-openshift-e2e

@Shai1-Levi
Copy link

/test 4.16-openshift-e2e

@mshitrit
Copy link
Member Author

mshitrit commented Apr 1, 2025

/test 4.16-openshift-e2e

@mshitrit
Copy link
Member Author

mshitrit commented Apr 3, 2025

/test 4.16-openshift-e2e

 - Added wait period for CR to be deleted: MHC wasn't waiting for the CR deletion which failed the test because lease wasn't removed
- Increased waiting for lease removal for 2 minutes: MHC will remove the lease after the CR is removed, it will requeue Reconciles until then. That requeing is causing an exponential backoff - making current 1 minute wait too short (the remediation that is triggered by the CR deletion will not trigger the lease removal because it'll short circuit since MHC will not be found as it's using the CR's name).

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit mshitrit force-pushed the mhc-watch-remediation branch from aec2c01 to 2ac34fd Compare April 6, 2025 10:04
@mshitrit
Copy link
Member Author

mshitrit commented Apr 6, 2025

/test 4.16-openshift-e2e

@mshitrit mshitrit changed the title [WIP] MHC does not watch remediation MHC does not watch remediation Apr 8, 2025
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost lgtm, 2 comments inline

- add comments that were omitted in refactoring
- add MHC Template Mapper function

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit
Copy link
Member Author

mshitrit commented Apr 9, 2025

/test 4.16-openshift-e2e

@openshift-ci openshift-ci bot added the lgtm label Apr 15, 2025
Copy link
Contributor

openshift-ci bot commented Apr 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mshitrit mshitrit marked this pull request as ready for review April 15, 2025 14:10
@openshift-ci openshift-ci bot requested review from clobrano and slintes April 15, 2025 14:10
@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

@mshitrit
Copy link
Member Author

/retest

@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

@mshitrit
Copy link
Member Author

/retest

2 similar comments
@mshitrit
Copy link
Member Author

/retest

@mshitrit
Copy link
Member Author

/retest

@openshift-merge-bot openshift-merge-bot bot merged commit 71f4e82 into medik8s:main Apr 20, 2025
23 checks passed
@mshitrit
Copy link
Member Author

@mshitrit it looks like the MHC test still is pretty flaky, did you investigate?

Yeah, from what I can tell the following happens:

  1. Once MHC is updated the remediation keeps rescheduling until the CR is removed
  2. The Leases would be removed upon the Reconcile rescheduling that would happen right after the CR is removed.
  3. Since this rescheduling happens multiple time the duration is increased due to exponential back-off there is some flakiness regarding the time it'll happen
  4. Waiting for CR deletion (one of the updates in this PR) mitigates the problem but still does not guarantee that the remediation will be scheduled in the correct time window.

IMO the simplest way stabilize this (assuming we just want to modify the test) is increasing the time tolerance for leases to be deleted.
Let me know WDYT.

@mshitrit mshitrit mentioned this pull request May 7, 2025
if watchType == NHC {
return owner.Kind == "NodeHealthCheck" && owner.APIVersion == remediationv1alpha1.GroupVersion.String()
} else {
return owner.Kind == "Machine" && owner.APIVersion == machinev1beta1.GroupVersion.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "MachineHealthCheck" and not "Machine", we are currently trying to reconcile MHCs with Machine names... that's why tests are still flaky...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants