Skip to content

Fix flaky e2e #373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 14, 2025
Merged

Fix flaky e2e #373

merged 7 commits into from
Jul 14, 2025

Conversation

slintes
Copy link
Member

@slintes slintes commented Jul 8, 2025

Why we need this PR

Commands for preparing MHC tests fail regularly which cuases flaky e2e test

Changes made

  • wait for healty cluster
  • add retry to relevant commands in e2e script
  • remove OCP version check (we run on OCP 4.14+ only now)
  • improve retry when creating resources in e2e suite
  • update gitignore

Which issue(s) this PR fixes

RHWA-100

Summary by CodeRabbit

Summary by CodeRabbit

  • Chores

    • Updated file ignore settings to exclude the .history directory from version control.
  • Refactor

    • Improved reliability of end-to-end test scripts by adding automatic retries to critical commands.
    • Simplified test execution by removing version checks and relying on environment variables.
    • Extended wait times and enhanced retry logic in test readiness checks to improve stability.

Copy link
Contributor

openshift-ci bot commented Jul 8, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

coderabbitai bot commented Jul 8, 2025

Walkthrough

The .gitignore file was updated to ignore the .history directory. In the hack/test-e2e.sh script, OpenShift version checks were removed and replaced with simpler environment gating. A new retry function was added to wrap critical oc commands, adding retry logic for improved robustness. The e2e test suite increased timeouts and polling intervals for resource readiness checks and moved resource get-or-create logic inside an Eventually block.

Changes

File(s) Change Summary
.gitignore Added .history directory to the ignore list.
hack/test-e2e.sh Removed OpenShift version checks; added retry function; wrapped critical oc commands with retry logic and simplified environment gating.
e2e/e2e_suite_test.go Refactored resource existence checks to run inside an Eventually block with increased timeout (2s → 5s) and polling interval (100ms → 500ms); changed error assertions to use local Gomega instance.

Sequence Diagram(s)

sequenceDiagram
    participant Tester
    participant Script
    participant oc

    Tester->>Script: Run test-e2e.sh
    Script->>Script: Define retry()
    loop For each critical oc command
        Script->>Script: retry(<retries>, <wait>, "oc ...")
        alt Command succeeds
            Script->>oc: Execute oc command
        else Command fails
            Script->>Script: Wait and retry
        end
    end
    Script->>Tester: Continue with test execution
Loading

Possibly related PRs

  • Fix flaky tests #370: Improves e2e test synchronization by fixing flaky resource existence checks, related to the increased timeouts and polling intervals in e2e tests here.

Poem

In the warren of code, a history fades,
As .history hides in the shade.
With retries in place, the tests hop anew,
No version checks block what rabbits pursue.
Three leaps for each command, just in case—
Robustness and order, in this burrowed space! 🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8a70f2 and 0afd69c.

📒 Files selected for processing (1)
  • e2e/e2e_suite_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • e2e/e2e_suite_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: e2e-k8s
  • GitHub Check: build-and-unit-test
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

openshift-ci bot commented Jul 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jul 8, 2025
@slintes
Copy link
Member Author

slintes commented Jul 8, 2025

@coderabbitai review

Copy link

coderabbitai bot commented Jul 8, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@slintes
Copy link
Member Author

slintes commented Jul 8, 2025

/test 4.18-openshift-e2e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
hack/test-e2e.sh (1)

5-7: Add pipefail to catch hidden errors in pipelines

With set -e alone, a failing command in a pipeline is masked if a later command succeeds (e.g., grep … | oc create …).
Enabling pipefail makes the script terminate on the first failing element, improving reliability—especially important now that this script manipulates cluster-level resources.

-set -e
+set -e -o pipefail
🧹 Nitpick comments (2)
hack/test-e2e.sh (2)

53-70: retry helper is solid but can be hardened

  1. set -e remains in effect inside the function; that’s fine for the until condition but any failure in the body of the loop (e.g., sleep) would still abort the script. Add set +e locally or document the expectation.
  2. Consider jitter (sleep $((wait + RANDOM % 3))) to avoid thundering-herd effects when many jobs retry simultaneously in CI.
  3. Echoing $* unquoted can mangle arguments that contain spaces; use "$*" inside quotes for diagnostics.

Not a blocker, but worth polishing.


72-80: Flag inconsistency in --type syntax

You mix --type=merge (lines 73-74) with --type merge (line 79). Both work, yet consistency aids greppability and reduces cognitive load. Pick one form—prefer the equal-sign variant used elsewhere.

-retry 3 5 oc patch featuregate cluster --type merge --patch \
+retry 3 5 oc patch featuregate cluster --type=merge --patch \
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3ce42b and a7a77a6.

📒 Files selected for processing (2)
  • .gitignore (1 hunks)
  • hack/test-e2e.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
hack/test-e2e.sh (2)
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: e2e-k8s
  • GitHub Check: build-and-unit-test
🔇 Additional comments (2)
.gitignore (1)

27-28: Addition looks good – local IDE history stays out of Git

Ignoring the .history folder is a sensible default and doesn’t conflict with any existing patterns. 👍

hack/test-e2e.sh (1)

95-108: Long-living clusters will stay “forever un-upgradeable”

Cleanup that re-enables upgrades is commented out. This is fine for ephemeral CI clusters but dangerous if someone runs the script against a shared or persistent environment.

Recommendations:

  • Re-enable the cleanup by default; add --no-cleanup flag for local debugging.
  • Or, at minimum, print a loud warning before exit when the feature gate is still enabled.

Please verify this aligns with intended usage.

@slintes
Copy link
Member Author

slintes commented Jul 8, 2025

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

This update introduces a retry function to the test-e2e.sh script, allowing patch commands for MachineConfigPools and feature gates to be retried up to three times with a five-second wait between attempts. This change aims to improve the reliability of the e2e tests by handling transient failures more gracefully. Additionally, the logic for copying the SelfNodeRemediation template has been updated to use the retry mechanism.

Signed-off-by: Marc Sluiter <[email protected]>
@slintes
Copy link
Member Author

slintes commented Jul 8, 2025

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jul 9, 2025

4.18 actual test failed, others show no occurrence of the retry...

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

This update introduces a conditional check to the test-e2e.sh script that exits the script early if it is not running in an OpenShift CI environment. This change aims to prevent unnecessary execution of MachineHealthCheck tests outside of the OCP CI context, preventing failures on k8s environments.

Signed-off-by: Marc Sluiter <[email protected]>
@slintes
Copy link
Member Author

slintes commented Jul 9, 2025

/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

@slintes
Copy link
Member Author

slintes commented Jul 9, 2025

the fix worked in 4.19:

Copying SNR template to openshift-machine-api
Error from server: error when creating "STDIN": rpc error: code = Unavailable desc = error reading from server: read tcp 10.0.8.141:53128->10.0.6.248:2379: read: connection reset by peer
Retry 1/3 failed. Retrying in 5 seconds...
...
Copying SNR template to openshift-machine-api
selfnoderemediationtemplate.self-node-remediation.medik8s.io/self-node-remediation-automatic-strategy-template created

There is another issue obviously though....

…nce check

This change updates the timeout from 2 seconds to 5 seconds and the polling interval from 100 milliseconds to 500 milliseconds in the e2e test suite. The adjustment aims to improve the reliability of the test by allowing more time for resources to become available before asserting their existence.

Signed-off-by: Marc Sluiter <[email protected]>
@slintes
Copy link
Member Author

slintes commented Jul 9, 2025

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
/test 4.17-openshift-e2e
/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

@mshitrit
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 10, 2025
@slintes slintes marked this pull request as ready for review July 10, 2025 07:40
@slintes slintes changed the title WIP Fix flaky e2e Fix flaky e2e Jul 10, 2025
@openshift-ci openshift-ci bot requested review from clobrano and mshitrit July 10, 2025 07:41
@openshift-ci openshift-ci bot removed the lgtm label Jul 11, 2025
@slintes
Copy link
Member Author

slintes commented Jul 11, 2025

looks good now :)

one more try

/test 4.16-openshift-e2e
/test 4.17-openshift-e2e
/test 4.18-openshift-e2e
/test 4.19-openshift-e2e
/test 4.20-openshift-e2e

…statements

This update adds a health check for the cluster after running MachineHealthCheck tests, ensuring the cluster is healthy before proceeding. The echo statements have been reordered for better clarity, with the preparation message now appearing after the health check confirmation.

Signed-off-by: Marc Sluiter <[email protected]>
This update modifies the resource existence check in the e2e test suite by moving the retrieval logic inside the Eventually function. It ensures that missing resources are created and adds a brief sleep to allow for resource availability before asserting their existence. This change aims to enhance the reliability of the tests by handling resource creation more effectively.

Signed-off-by: Marc Sluiter <[email protected]>
@mshitrit
Copy link
Member

/lgtm
/hold
giving others chance to review as well.
feel free to unhold.


echo "Preparing MachineHealthCheck e2e tests"

echo "Pausing MachineConfigPools in order to prevent reboots after enabling feature gate"
oc patch machineconfigpool worker --type=merge --patch='{"spec":{"paused":true}}'
oc patch machineconfigpool master --type=merge --patch='{"spec":{"paused":true}}'
retry 3 5 oc patch machineconfigpool worker --type=merge --patch='{"spec":{"paused":true}}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Any conclusion why retrying 3 times for 5 seconds is sufficient here and in other places? I don't mind having more time, but I am just curious about the troubleshooting part of the flaky e2e :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some random numbers 🤷🏼

@razo7
Copy link
Member

razo7 commented Jul 13, 2025

/lgtm

remove OCP version check (we run on OCP 4.14+ only now)

OCP 4.16+ since last week :)

@slintes
Copy link
Member Author

slintes commented Jul 14, 2025

/hold cancel

@openshift-merge-bot openshift-merge-bot bot merged commit 2545fc1 into medik8s:main Jul 14, 2025
24 checks passed
@slintes
Copy link
Member Author

slintes commented Jul 14, 2025

/cherry-pick release-0.9

@openshift-cherrypick-robot

@slintes: #373 failed to apply on top of branch "release-0.9":

Applying: Clean up test-e2e.sh by removing OCP version checks for MHC tests
Applying: Update .gitignore to exclude .history directory
Applying: Add retry mechanism to test-e2e.sh for MHC test commands
Applying: Add check for OpenShift CI in test-e2e.sh to skip MHC tests
Applying: Increase timeout and polling interval in e2e test for resource existence check
Using index info to reconstruct a base tree...
M	e2e/e2e_suite_test.go
Falling back to patching base and 3-way merge...
Auto-merging e2e/e2e_suite_test.go
CONFLICT (content): Merge conflict in e2e/e2e_suite_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0005 Increase timeout and polling interval in e2e test for resource existence check

In response to this:

/cherry-pick release-0.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

slintes pushed a commit to slintes/node-healthcheck-operator that referenced this pull request Jul 14, 2025
Fix flaky e2e

(cherry picked from commit 2545fc1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants