You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The post-upgrade E2E test at post_upgrade_test.go#L101-L112 appears to be experiencing flakiness due to timing issues. Currently, the test has a 1-minute timeout, which should theoretically be sufficient for the operations it validates. However, under certain conditions, it seems the test may fail due to delays in reconciliation or related operations.
Expected Behavior
The test should complete successfully within the allocated 1-minute timeout.
Observed Behavior
The test sometimes fails, likely due to timing issues during the post-upgrade process.
Analysis
Without additional context, it appears that the 1-minute timeout should be sufficient to cover the following steps:
Catalog cache repopulation (potentially async, initiated after OLM upgrade).
**Reconciliation of the cluster extension ** may involve exponential backoff while the catalog cache repopulates.
Upgrade process:
Resolving and running the upgrade after cache repopulation.
Dry-running the upgrade to generate the desired manifest.
Running preflight checks.
If the checks pass, the actual upgrade will be run (similar to a helm upgrade without waiting for object health/readiness).
A potential root cause is hitting exponential backoff on cache repopulation, where an extra reconciliation attempt increases the backoff time, possibly exceeding the 1-minute threshold.
Steps to Reproduce
Push a PR to trigger the test
Before finishing, push another change. Then, when more than 1 GitHub action is running, the tests often fail, and we can check the flake.
Suggested Next Steps
Investigate if exponential backoff during cache repopulation is causing the delay.
Add logging to identify where the test is spending time and which step(s) contribute most to the timeout.
Consider increasing the timeout to accommodate edge cases or optimizing the underlying operations.
Determine if additional retries or adjustments to reconciliation backoff logic are needed.
Check if GitHub Actions share the resources across the many VMs created (Maybe)
The post-upgrade E2E test at post_upgrade_test.go#L101-L112 appears to be experiencing flakiness due to timing issues. Currently, the test has a 1-minute timeout, which should theoretically be sufficient for the operations it validates. However, under certain conditions, it seems the test may fail due to delays in reconciliation or related operations.
Expected Behavior
The test should complete successfully within the allocated 1-minute timeout.
Observed Behavior
The test sometimes fails, likely due to timing issues during the post-upgrade process.
Analysis
Without additional context, it appears that the 1-minute timeout should be sufficient to cover the following steps:
helm upgrade
without waiting for object health/readiness).A potential root cause is hitting exponential backoff on cache repopulation, where an extra reconciliation attempt increases the backoff time, possibly exceeding the 1-minute threshold.
Steps to Reproduce
Suggested Next Steps
More info: https://redhat-internal.slack.com/archives/C06KP34REFJ/p1736260935491789
The text was updated successfully, but these errors were encountered: