-
Notifications
You must be signed in to change notification settings - Fork 26
Fix flaky e2e #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky e2e #373
Conversation
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Skipping CI for Draft Pull Request. |
WalkthroughThe Changes
Sequence Diagram(s)sequenceDiagram
participant Tester
participant Script
participant oc
Tester->>Script: Run test-e2e.sh
Script->>Script: Define retry()
loop For each critical oc command
Script->>Script: retry(<retries>, <wait>, "oc ...")
alt Command succeeds
Script->>oc: Execute oc command
else Command fails
Script->>Script: Wait and retry
end
end
Script->>Tester: Continue with test execution
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: slintes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@coderabbitai review |
✅ Actions performedReview triggered.
|
/test 4.18-openshift-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
hack/test-e2e.sh (1)
5-7
: Addpipefail
to catch hidden errors in pipelinesWith
set -e
alone, a failing command in a pipeline is masked if a later command succeeds (e.g.,grep … | oc create …
).
Enablingpipefail
makes the script terminate on the first failing element, improving reliability—especially important now that this script manipulates cluster-level resources.-set -e +set -e -o pipefail
🧹 Nitpick comments (2)
hack/test-e2e.sh (2)
53-70
:retry
helper is solid but can be hardened
set -e
remains in effect inside the function; that’s fine for theuntil
condition but any failure in the body of the loop (e.g.,sleep
) would still abort the script. Addset +e
locally or document the expectation.- Consider jitter (
sleep $((wait + RANDOM % 3))
) to avoid thundering-herd effects when many jobs retry simultaneously in CI.- Echoing
$*
unquoted can mangle arguments that contain spaces; use"$*"
inside quotes for diagnostics.Not a blocker, but worth polishing.
72-80
: Flag inconsistency in--type
syntaxYou mix
--type=merge
(lines 73-74) with--type merge
(line 79). Both work, yet consistency aids greppability and reduces cognitive load. Pick one form—prefer the equal-sign variant used elsewhere.-retry 3 5 oc patch featuregate cluster --type merge --patch \ +retry 3 5 oc patch featuregate cluster --type=merge --patch \
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.gitignore
(1 hunks)hack/test-e2e.sh
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
hack/test-e2e.sh (2)
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: e2e-k8s
- GitHub Check: build-and-unit-test
🔇 Additional comments (2)
.gitignore (1)
27-28
: Addition looks good – local IDE history stays out of GitIgnoring the
.history
folder is a sensible default and doesn’t conflict with any existing patterns. 👍hack/test-e2e.sh (1)
95-108
: Long-living clusters will stay “forever un-upgradeable”Cleanup that re-enables upgrades is commented out. This is fine for ephemeral CI clusters but dangerous if someone runs the script against a shared or persistent environment.
Recommendations:
- Re-enable the cleanup by default; add
--no-cleanup
flag for local debugging.- Or, at minimum, print a loud warning before exit when the feature gate is still enabled.
Please verify this aligns with intended usage.
/test 4.18-openshift-e2e |
This update introduces a retry function to the test-e2e.sh script, allowing patch commands for MachineConfigPools and feature gates to be retried up to three times with a five-second wait between attempts. This change aims to improve the reliability of the e2e tests by handling transient failures more gracefully. Additionally, the logic for copying the SelfNodeRemediation template has been updated to use the retry mechanism. Signed-off-by: Marc Sluiter <[email protected]>
/test 4.18-openshift-e2e |
4.18 actual test failed, others show no occurrence of the retry... /test 4.18-openshift-e2e |
This update introduces a conditional check to the test-e2e.sh script that exits the script early if it is not running in an OpenShift CI environment. This change aims to prevent unnecessary execution of MachineHealthCheck tests outside of the OCP CI context, preventing failures on k8s environments. Signed-off-by: Marc Sluiter <[email protected]>
/test 4.18-openshift-e2e |
the fix worked in 4.19:
There is another issue obviously though.... |
…nce check This change updates the timeout from 2 seconds to 5 seconds and the polling interval from 100 milliseconds to 500 milliseconds in the e2e test suite. The adjustment aims to improve the reliability of the test by allowing more time for resources to become available before asserting their existence. Signed-off-by: Marc Sluiter <[email protected]>
/test 4.14-openshift-e2e |
/lgtm |
looks good now :) one more try /test 4.16-openshift-e2e |
…statements This update adds a health check for the cluster after running MachineHealthCheck tests, ensuring the cluster is healthy before proceeding. The echo statements have been reordered for better clarity, with the preparation message now appearing after the health check confirmation. Signed-off-by: Marc Sluiter <[email protected]>
This update modifies the resource existence check in the e2e test suite by moving the retrieval logic inside the Eventually function. It ensures that missing resources are created and adds a brief sleep to allow for resource availability before asserting their existence. This change aims to enhance the reliability of the tests by handling resource creation more effectively. Signed-off-by: Marc Sluiter <[email protected]>
/lgtm |
|
||
echo "Preparing MachineHealthCheck e2e tests" | ||
|
||
echo "Pausing MachineConfigPools in order to prevent reboots after enabling feature gate" | ||
oc patch machineconfigpool worker --type=merge --patch='{"spec":{"paused":true}}' | ||
oc patch machineconfigpool master --type=merge --patch='{"spec":{"paused":true}}' | ||
retry 3 5 oc patch machineconfigpool worker --type=merge --patch='{"spec":{"paused":true}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Any conclusion why retrying 3 times for 5 seconds is sufficient here and in other places? I don't mind having more time, but I am just curious about the troubleshooting part of the flaky e2e :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some random numbers 🤷🏼
/lgtm
OCP 4.16+ since last week :) |
/hold cancel |
/cherry-pick release-0.9 |
@slintes: #373 failed to apply on top of branch "release-0.9":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Fix flaky e2e (cherry picked from commit 2545fc1)
Why we need this PR
Commands for preparing MHC tests fail regularly which cuases flaky e2e test
Changes made
Which issue(s) this PR fixes
RHWA-100
Summary by CodeRabbit
Summary by CodeRabbit
Chores
.history
directory from version control.Refactor