Fix flaky e2e #373

slintes · 2025-07-08T17:45:58Z

Why we need this PR

Commands for preparing MHC tests fail regularly which cuases flaky e2e test

Changes made

wait for healty cluster
add retry to relevant commands in e2e script
remove OCP version check (we run on OCP 4.14+ only now)
improve retry when creating resources in e2e suite
update gitignore

Which issue(s) this PR fixes

RHWA-100

Summary by CodeRabbit

Chores
- Updated file ignore settings to exclude the .history directory from version control.
Refactor
- Improved reliability of end-to-end test scripts by adding automatic retries to critical commands.
- Simplified test execution by removing version checks and relying on environment variables.
- Extended wait times and enhanced retry logic in test readiness checks to improve stability.

Signed-off-by: Marc Sluiter <[email protected]>

openshift-ci · 2025-07-08T17:46:04Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2025-07-08T17:46:05Z

Walkthrough

The .gitignore file was updated to ignore the .history directory. In the hack/test-e2e.sh script, OpenShift version checks were removed and replaced with simpler environment gating. A new retry function was added to wrap critical oc commands, adding retry logic for improved robustness. The e2e test suite increased timeouts and polling intervals for resource readiness checks and moved resource get-or-create logic inside an Eventually block.

Changes

File(s)	Change Summary
.gitignore	Added `.history` directory to the ignore list.
hack/test-e2e.sh	Removed OpenShift version checks; added `retry` function; wrapped critical `oc` commands with retry logic and simplified environment gating.
e2e/e2e_suite_test.go	Refactored resource existence checks to run inside an `Eventually` block with increased timeout (2s → 5s) and polling interval (100ms → 500ms); changed error assertions to use local `Gomega` instance.

Sequence Diagram(s)

sequenceDiagram
    participant Tester
    participant Script
    participant oc

    Tester->>Script: Run test-e2e.sh
    Script->>Script: Define retry()
    loop For each critical oc command
        Script->>Script: retry(<retries>, <wait>, "oc ...")
        alt Command succeeds
            Script->>oc: Execute oc command
        else Command fails
            Script->>Script: Wait and retry
        end
    end
    Script->>Tester: Continue with test execution

Possibly related PRs

Fix flaky tests #370: Improves e2e test synchronization by fixing flaky resource existence checks, related to the increased timeouts and polling intervals in e2e tests here.

Poem

In the warren of code, a history fades,
As .history hides in the shade.
With retries in place, the tests hop anew,
No version checks block what rabbits pursue.
Three leaps for each command, just in case—
Robustness and order, in this burrowed space! 🐇

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8a70f2 and 0afd69c.

📒 Files selected for processing (1)

e2e/e2e_suite_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

e2e/e2e_suite_test.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: e2e-k8s
GitHub Check: build-and-unit-test

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

openshift-ci · 2025-07-08T17:46:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [slintes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes · 2025-07-08T17:46:42Z

@coderabbitai review

coderabbitai · 2025-07-08T17:46:48Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

slintes · 2025-07-08T17:47:29Z

/test 4.18-openshift-e2e

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

hack/test-e2e.sh (1)
5-7: Add pipefail to catch hidden errors in pipelines

With set -e alone, a failing command in a pipeline is masked if a later command succeeds (e.g., grep … | oc create …).
Enabling pipefail makes the script terminate on the first failing element, improving reliability—especially important now that this script manipulates cluster-level resources.
-set -e
+set -e -o pipefail

🧹 Nitpick comments (2)

hack/test-e2e.sh (2)
53-70: retry helper is solid but can be hardened

set -e remains in effect inside the function; that’s fine for the until condition but any failure in the body of the loop (e.g., sleep) would still abort the script. Add set +e locally or document the expectation.

Consider jitter (sleep $((wait + RANDOM % 3))) to avoid thundering-herd effects when many jobs retry simultaneously in CI.

Echoing $* unquoted can mangle arguments that contain spaces; use "$*" inside quotes for diagnostics.

Not a blocker, but worth polishing.

72-80: Flag inconsistency in --type syntax

You mix --type=merge (lines 73-74) with --type merge (line 79). Both work, yet consistency aids greppability and reduces cognitive load. Pick one form—prefer the equal-sign variant used elsewhere.
-retry 3 5 oc patch featuregate cluster --type merge --patch \
+retry 3 5 oc patch featuregate cluster --type=merge --patch \

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3ce42b and a7a77a6.

📒 Files selected for processing (2)

.gitignore (1 hunks)
hack/test-e2e.sh (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

hack/test-e2e.sh (2)

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: e2e-k8s
GitHub Check: build-and-unit-test

🔇 Additional comments (2)

.gitignore (1)

27-28: Addition looks good – local IDE history stays out of Git

Ignoring the .history folder is a sensible default and doesn’t conflict with any existing patterns. 👍

hack/test-e2e.sh (1)

95-108: Long-living clusters will stay “forever un-upgradeable”

Cleanup that re-enables upgrades is commented out. This is fine for ephemeral CI clusters but dangerous if someone runs the script against a shared or persistent environment.

Recommendations:

Re-enable the cleanup by default; add --no-cleanup flag for local debugging.

Or, at minimum, print a loud warning before exit when the feature gate is still enabled.

Please verify this aligns with intended usage.

slintes · 2025-07-08T19:52:56Z

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

This update introduces a retry function to the test-e2e.sh script, allowing patch commands for MachineConfigPools and feature gates to be retried up to three times with a five-second wait between attempts. This change aims to improve the reliability of the e2e tests by handling transient failures more gracefully. Additionally, the logic for copying the SelfNodeRemediation template has been updated to use the retry mechanism. Signed-off-by: Marc Sluiter <[email protected]>

slintes · 2025-07-08T20:24:37Z

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

slintes · 2025-07-09T07:19:07Z

4.18 actual test failed, others show no occurrence of the retry...

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

This update introduces a conditional check to the test-e2e.sh script that exits the script early if it is not running in an OpenShift CI environment. This change aims to prevent unnecessary execution of MachineHealthCheck tests outside of the OCP CI context, preventing failures on k8s environments. Signed-off-by: Marc Sluiter <[email protected]>

slintes · 2025-07-09T07:28:42Z

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

slintes · 2025-07-09T16:07:00Z

the fix worked in 4.19:

Copying SNR template to openshift-machine-api
Error from server: error when creating "STDIN": rpc error: code = Unavailable desc = error reading from server: read tcp 10.0.8.141:53128->10.0.6.248:2379: read: connection reset by peer
Retry 1/3 failed. Retrying in 5 seconds...
...
Copying SNR template to openshift-machine-api
selfnoderemediationtemplate.self-node-remediation.medik8s.io/self-node-remediation-automatic-strategy-template created

There is another issue obviously though....

…nce check This change updates the timeout from 2 seconds to 5 seconds and the polling interval from 100 milliseconds to 500 milliseconds in the e2e test suite. The adjustment aims to improve the reliability of the test by allowing more time for resources to become available before asserting their existence. Signed-off-by: Marc Sluiter <[email protected]>

slintes · 2025-07-09T16:38:57Z

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
/test 4.17-openshift-e2e
/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

mshitrit · 2025-07-10T07:30:54Z

/lgtm

slintes · 2025-07-11T13:31:57Z

looks good now :)

one more try

/test 4.16-openshift-e2e
/test 4.17-openshift-e2e
/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

…statements This update adds a health check for the cluster after running MachineHealthCheck tests, ensuring the cluster is healthy before proceeding. The echo statements have been reordered for better clarity, with the preparation message now appearing after the health check confirmation. Signed-off-by: Marc Sluiter <[email protected]>

This update modifies the resource existence check in the e2e test suite by moving the retrieval logic inside the Eventually function. It ensures that missing resources are created and adds a brief sleep to allow for resource availability before asserting their existence. This change aims to enhance the reliability of the tests by handling resource creation more effectively. Signed-off-by: Marc Sluiter <[email protected]>

mshitrit · 2025-07-13T11:17:25Z

/lgtm
/hold
giving others chance to review as well.
feel free to unhold.

razo7 · 2025-07-13T12:40:13Z

hack/test-e2e.sh


 echo "Preparing MachineHealthCheck e2e tests"

 echo "Pausing MachineConfigPools in order to prevent reboots after enabling feature gate"
-oc patch machineconfigpool worker --type=merge --patch='{"spec":{"paused":true}}'
-oc patch machineconfigpool master --type=merge --patch='{"spec":{"paused":true}}'
+retry 3 5 oc patch machineconfigpool worker --type=merge --patch='{"spec":{"paused":true}}'


NIT: Any conclusion why retrying 3 times for 5 seconds is sufficient here and in other places? I don't mind having more time, but I am just curious about the troubleshooting part of the flaky e2e :)

just some random numbers 🤷🏼

razo7 · 2025-07-13T12:40:33Z

/lgtm

remove OCP version check (we run on OCP 4.14+ only now)

OCP 4.16+ since last week :)

slintes · 2025-07-14T07:16:59Z

/hold cancel

slintes · 2025-07-14T07:37:40Z

/cherry-pick release-0.9

openshift-cherrypick-robot · 2025-07-14T07:38:20Z

@slintes: #373 failed to apply on top of branch "release-0.9":

Applying: Clean up test-e2e.sh by removing OCP version checks for MHC tests
Applying: Update .gitignore to exclude .history directory
Applying: Add retry mechanism to test-e2e.sh for MHC test commands
Applying: Add check for OpenShift CI in test-e2e.sh to skip MHC tests
Applying: Increase timeout and polling interval in e2e test for resource existence check
Using index info to reconstruct a base tree...
M	e2e/e2e_suite_test.go
Falling back to patching base and 3-way merge...
Auto-merging e2e/e2e_suite_test.go
CONFLICT (content): Merge conflict in e2e/e2e_suite_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0005 Increase timeout and polling interval in e2e test for resource existence check

In response to this:

/cherry-pick release-0.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Fix flaky e2e (cherry picked from commit 2545fc1)

slintes added 2 commits July 8, 2025 19:36

Clean up test-e2e.sh by removing OCP version checks for MHC tests

a206821

Signed-off-by: Marc Sluiter <[email protected]>

Update .gitignore to exclude .history directory

02c4cc3

Signed-off-by: Marc Sluiter <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress label Jul 8, 2025

openshift-ci bot added the approved label Jul 8, 2025

coderabbitai bot reviewed Jul 8, 2025

View reviewed changes

slintes force-pushed the fix-flaky-e2e branch from a7a77a6 to 7f15c56 Compare July 8, 2025 19:52

slintes force-pushed the fix-flaky-e2e branch from 7f15c56 to bc46b78 Compare July 8, 2025 20:24

openshift-ci bot assigned mshitrit Jul 10, 2025

openshift-ci bot added the lgtm label Jul 10, 2025

slintes marked this pull request as ready for review July 10, 2025 07:40

slintes changed the title ~~WIP Fix flaky e2e~~ Fix flaky e2e Jul 10, 2025

openshift-ci bot requested review from clobrano and mshitrit July 10, 2025 07:41

openshift-ci bot removed the do-not-merge/work-in-progress label Jul 10, 2025

openshift-ci bot removed the lgtm label Jul 11, 2025

slintes force-pushed the fix-flaky-e2e branch from 2a678b6 to a8a70f2 Compare July 11, 2025 15:56

openshift-ci bot added do-not-merge/hold lgtm labels Jul 13, 2025

razo7 reviewed Jul 13, 2025

View reviewed changes

openshift-ci bot assigned razo7 Jul 13, 2025

openshift-ci bot removed the do-not-merge/hold label Jul 14, 2025

openshift-merge-bot bot merged commit 2545fc1 into medik8s:main Jul 14, 2025
24 checks passed

slintes pushed a commit to slintes/node-healthcheck-operator that referenced this pull request Jul 14, 2025

Merge pull request medik8s#373 from slintes/fix-flaky-e2e

6ddbc51

Fix flaky e2e (cherry picked from commit 2545fc1)

Fix flaky e2e #373

Fix flaky e2e #373

Uh oh!

Conversation

slintes commented Jul 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why we need this PR

Changes made

Which issue(s) this PR fixes

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Jul 8, 2025

Uh oh!

coderabbitai bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

openshift-ci bot commented Jul 8, 2025

Uh oh!

slintes commented Jul 8, 2025

Uh oh!

coderabbitai bot commented Jul 8, 2025

Uh oh!

slintes commented Jul 8, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

slintes commented Jul 8, 2025

Uh oh!

slintes commented Jul 8, 2025

Uh oh!

slintes commented Jul 9, 2025

Uh oh!

slintes commented Jul 9, 2025

Uh oh!

slintes commented Jul 9, 2025

Uh oh!

slintes commented Jul 9, 2025

Uh oh!

mshitrit commented Jul 10, 2025

Uh oh!

slintes commented Jul 11, 2025

Uh oh!

mshitrit commented Jul 13, 2025

Uh oh!

razo7 Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

slintes Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

razo7 commented Jul 13, 2025

Uh oh!

slintes commented Jul 14, 2025

Uh oh!

Uh oh!

slintes commented Jul 14, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 14, 2025

Uh oh!

Uh oh!

slintes commented Jul 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)