-
Notifications
You must be signed in to change notification settings - Fork 22
Add retry on executing oc commands; fix healthcheck condition #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The cluster on slow VM's is not ready in enough amount of time. Try to execute command again instead of fail. Signed-off-by: Daniel Pawlik <[email protected]>
The deployment will fail if the altnernative domain would not be found.
WalkthroughAdded retry loops, standardized success checks to Ansible's Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (1)
7-9: Standardize retry configuration across both tasks.The two tasks use inconsistent retry counts: Task 1 hardcodes
retries: 60while Task 2 uses the parameterizedretries: "{{ max_retries }}". For consistency and maintainability, consider parameterizing the first task as well.Apply this diff to align the retry configuration:
- name: Get route to console custom ansible.builtin.shell: | oc get route console-custom -n openshift-console register: _route_console_custom until: _route_console_custom is succeeded - retries: 60 + retries: "{{ max_retries }}" delay: "{{ retry_delay }}" changed_when: false
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
ansible/roles/deploy-crc-cloud/tasks/console_route.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml(2 hunks)ansible/roles/deploy-crc-cloud/tasks/login.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml(1 hunks)ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build (macOS-latest, 1.20)
- GitHub Check: build (ubuntu-latest, 1.20)
- GitHub Check: build-and-push-image
🔇 Additional comments (15)
ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (2)
16-16: Verify jsonpath extraction handles missing or empty host values.The jsonpath expression
'{.spec.host}'will succeed (exit code 0) even if the host field is empty or missing, returning an empty string. Since the task usesuntil: _console_spec_host is succeeded, it may consider the operation successful even when the route is not fully ready.Consider adding a fact validation step or adjusting the condition to ensure the extracted host is not empty.
Please verify that downstream tasks properly validate the
_console_spec_host.stdoutvalue to ensure it is not empty before using it.
16-16: Variablesretry_delayandmax_retriesare already properly defined.Both variables are defined in
ansible/roles/deploy-crc-cloud/defaults/main.yaml(max_retries: 20, retry_delay: 15) and will not cause runtime errors.ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml (1)
21-31: Well-structured retry pattern with proper idempotence guards.The addition of
register,retries,delay,until, andchanged_when: falseestablishes a robust retry mechanism that prevents transient API server issues from causing deployment failures. Theis succeededfilter is the correct Ansible approach for checking task success.ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml (1)
15-25: Consistent retry semantics appropriately applied to secret creation.The retry mechanism is suitable for TLS secret creation operations, which may be sensitive to transient API server delays. The
changed_when: falsesuppression is correct since this task regenerates the YAML from command output on each iteration.ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml (1)
3-14: Idempotent retry pattern correctly applied to route patch operation.The addition of retry semantics with
changed_when: falseensures the task handles transient failures gracefully without generating spurious change notifications.ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml (1)
24-43: Consistent retry pattern applied to two-step credential workflow.Both the generic secret creation and replacement operations properly use retry semantics with
changed_when: falseto ensure idempotent behavior. The use of--dry-run=clientsyntax (line 30) is correct and future-compatible.ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml (1)
21-30: Retry semantics correctly applied to SSH key injection via machineconfig patch.The idempotent retry pattern is well-suited for SSH key patching operations, as Kubernetes API server responsiveness during initial cluster boot can be variable. The
changed_when: falseis appropriate since the patch command's idempotent behavior means repeated invocations don't necessitate change notifications.ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1)
5-23: Critical healthcheck fix: now detects both component unhealthiness and connection failures.The expanded
untilcondition (lines 11–13) correctly addresses the stated PR objective by checking for both unhealthy components ('False') and API connection errors ('was refused'). This prevents the wait loop from exiting prematurely when the API server is unreachable, which was causing downstream oc command failures.Please confirm that
'was refused'is the correct error string that appears inoc get cooutput when the API server connection is refused. You may want to verify this against the actual oc CLI behavior or test output.ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml (1)
3-12: Robust retry pattern for critical API server certificate configuration.The idempotent retry semantics with
changed_when: falseis well-suited for API server patching, as this operation requires reliable API availability and may fail during cluster initialization. The namedCertificates patch will now gracefully recover from transient failures.ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml (3)
34-44: Consistent retry pattern for client CA configmap creation.The registration and retry logic with
changed_when: falsefollows the established pattern and appropriately handles transient API failures during configmap creation.
46-55: Idempotent retry pattern for custom CA API server patch.The retry semantics with
changed_when: falseensures the client CA configuration is reliably applied without spurious change reporting.
57-67: Verify deprecated--dry-runflag syntax compatibility.Line 62 uses
--dry-runwithout an argument, which is the legacy syntax that has been superseded by--dry-run=clientor--dry-run=serverin modern Kubernetes/OpenShift versions. While the pattern (dry-run | replace) is idiomatic for create-or-replace operations, confirm that the target oc CLI version still supports this syntax to avoid future breakage.Please verify the minimum OpenShift/oc version required by this playbook and confirm that
--dry-runwithout an argument is still supported. Consider updating to--dry-run=clientfor forward compatibility:ansible.builtin.shell: > oc create configmap admin-kubeconfig-client-ca -n openshift-config --from-file=ca-bundle.crt={{ ca_name }}-ca.crt - --dry-run -o yaml | oc replace -f - + --dry-run=client -o yaml | oc replace -f -ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml (3)
4-4: Verify handling of undefinedopenshift_pull_secretvariable.The when condition uses
openshift_pull_secret | length > 0, which will error if the variable is undefined. Ansible typically allows undefined variables in length comparisons, but this should be verified depending on your Ansible version and configuration (especially iferror_on_undefined_varsis enabled).Consider adding a definedness check if strict error handling is preferred:
when: openshift_pull_secret is defined and openshift_pull_secret | length > 0Otherwise, confirm this pattern is acceptable in your environment.
3-15: Good refactoring of retry semantics and task completion condition.The changes systematically improve the task's robustness by:
- Moving the when condition to the task level (standard Ansible practice)
- Adding explicit retry logic with configurable retry count and delay
- Using a strict success condition (
until: _pull_secret_patch is succeeded) instead of the previous (implied) lenient checkThis directly addresses the PR objective that the health check condition was passing too early, causing downstream oc command failures. The
no_log: trueremains appropriately positioned to protect secrets.
11-15: Retry logic is correctly configured with variables properly defined.The retry pattern is well-structured and addresses the PR objective by ensuring the task only succeeds when the oc command actually succeeds (exit code 0). The
until: _pull_secret_patch is succeededcorrectly uses Ansible's fact test to validate task success.Verification confirms:
max_retriesandretry_delayare properly defined indefaults/main.yaml(max_retries: 20, retry_delay: 15)- The retry pattern is consistently used across 9+ task files in the role, demonstrating a well-established pattern
- The
changed_when: falseplacement is correct for preventing false "changed" reports during retriesNo changes required.
4b45ac7 to
bcb97fb
Compare
During the check, cluster would restart openshift-apiserver, which might fit the condition, but in later steps it would not be accessible for executing other comments. Wait properly for cluster to be stable. Signed-off-by: Daniel Pawlik <[email protected]>
bcb97fb to
a7f022f
Compare
The health check condition was passing to early, which causes errors in other places where OC command was executed.
Also updated dnsmasq configuration when additional_domain was set.
Summary by CodeRabbit
New Features
Bug Fixes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.