Add retry on executing oc commands; fix healthcheck condition by danpawlik · Pull Request #212 · crc-org/crc-cloud

danpawlik · 2025-12-12T09:27:14Z

The health check condition was passing to early, which causes errors in other places where OC command was executed.
Also updated dnsmasq configuration when additional_domain was set.

Summary by CodeRabbit

New Features
- Support for alternative domain entries in DNS configuration and ensured dnsmasq service is enabled.
Bug Fixes
- Improved cluster health checks to detect connection refusals.
- Suppressed false-positive change reports during repeated operations.
Improvements
- Broad retry/resilience added to many deployment steps with configurable delay.
- Standardized success checks and more reliable host extraction logic.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

The cluster on slow VM's is not ready in enough amount of time. Try to execute command again instead of fail. Signed-off-by: Daniel Pawlik <[email protected]>

The deployment will fail if the altnernative domain would not be found.

coderabbitai · 2025-12-12T09:27:29Z

Walkthrough

Added retry loops, standardized success checks to Ansible's is_succeeded, and replaced fixed delays with {{ retry_delay }} across deploy-crc-cloud tasks; switched some JSON extraction from jq to jsonpath; updated dnsmasq conditional block handling and refined cluster health wait conditions.

Changes

Cohort / File(s)	Summary
Route & Login `ansible/roles/deploy-crc-cloud/tasks/console_route.yaml`, `ansible/roles/deploy-crc-cloud/tasks/login.yaml`	Switched success checks to `is_succeeded`, replaced fixed `delay: 10` with `delay: "{{ retry_delay }}"`, added retry parameters; console route host extraction changed from `jq` to a `jsonpath` expression.
DNS (dnsmasq) `ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml`	Added conditional `blockinfile` to insert alternative domain mappings when defined; restart now triggers on `_dnsmasq_conf.changed` or `_dnsmasq_alt_domain_conf.changed`; ensured `dnsmasq` service is enabled in the systemd task.
Certificate & Secret Management `ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml`, `ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml`	Added `register`, `retries`, `delay`, `until: <var> is succeeded`, and `changed_when: false` to TLS secret and htpasswd secret tasks to enable retries and suppress intermediate change reporting.
Pull Secret & Pubkey Patch `ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml`, `ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml`	Moved `when` into task body (for pull secret), preserved `no_log: true`, and added `register`, retry params, `until: ... is succeeded`, and `changed_when: false` to make patch operations retryable without spurious change marks.
Cluster Patches & Ingress `ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml`, `ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml`, `ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml`	Introduced result registration, `retries`/`delay` using variables, `until: <var> is succeeded`, and `changed_when: false` so patch commands run in retry loops and avoid repeated change reporting.
Replace Default CA `ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml`	Added registers and retry loops with `until: ... is succeeded` and `changed_when: false` for configmap creation and apiserver patch steps used when replacing the default CA.
Cluster Health Wait `ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml`	Expanded `until` to ensure neither "'False'" nor "'was refused'" appear in component status output; removed previous `failed_when` and retained `ignore_errors: true`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Review focus:
- console_route.yaml: verify jsonpath equivalence to prior jq across response shapes.
- dnsmasq.yaml: confirm combined restart triggers and enabled flag behave on all platforms.
- Tasks with changed_when: false: ensure final successful operations still surface appropriate change at play level.
- wait_cluster_become_healthy.yaml: confirm added "'was refused'" check doesn't mask actionable failures.

Possibly related PRs

Change retries and delay time for cluster and resource verification #203 — also adjusts retry/wait behavior and retry_delay usage, likely related to timing/retry strategy changes.

Suggested reviewers

sdatko

Poem

🐇 I hopped through tasks both bold and bright,
Retries in my pouch to fix the night.
Delay now dances, checks succeed with grace,
Jsonpath found the host in place —
A rabbit's cheer for resilient space! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main objectives of the PR: adding retry logic to oc commands and fixing the healthcheck condition.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcb97fb and a7f022f.

📒 Files selected for processing (2)

ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml (2 hunks)
ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build (macOS-latest, 1.20)
GitHub Check: build-and-push-image
GitHub Check: build (ubuntu-latest, 1.20)

🔇 Additional comments (1)

ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1)

16-19: Success message is OK, but it may mislead if the probe didn’t actually validate anything
If you keep ignore_errors: true, the message becomes more trustworthy once the probe is gated on rc + non-empty output (above).

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (1)
7-9: Standardize retry configuration across both tasks.

The two tasks use inconsistent retry counts: Task 1 hardcodes retries: 60 while Task 2 uses the parameterized retries: "{{ max_retries }}". For consistency and maintainability, consider parameterizing the first task as well.

Apply this diff to align the retry configuration:
- name: Get route to console custom
  ansible.builtin.shell: |
    oc get route console-custom -n openshift-console
  register: _route_console_custom
  until: _route_console_custom is succeeded
-  retries: 60
+  retries: "{{ max_retries }}"
  delay: "{{ retry_delay }}"
  changed_when: false

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c98532 and 4b45ac7.

📒 Files selected for processing (12)

ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml (2 hunks)
ansible/roles/deploy-crc-cloud/tasks/login.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml (1 hunks)
ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build (macOS-latest, 1.20)
GitHub Check: build (ubuntu-latest, 1.20)
GitHub Check: build-and-push-image

🔇 Additional comments (15)

ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (2)

16-16: Verify jsonpath extraction handles missing or empty host values.

The jsonpath expression '{.spec.host}' will succeed (exit code 0) even if the host field is empty or missing, returning an empty string. Since the task uses until: _console_spec_host is succeeded, it may consider the operation successful even when the route is not fully ready.

Consider adding a fact validation step or adjusting the condition to ensure the extracted host is not empty.

Please verify that downstream tasks properly validate the _console_spec_host.stdout value to ensure it is not empty before using it.

16-16: Variables retry_delay and max_retries are already properly defined.

Both variables are defined in ansible/roles/deploy-crc-cloud/defaults/main.yaml (max_retries: 20, retry_delay: 15) and will not cause runtime errors.

ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml (1)

21-31: Well-structured retry pattern with proper idempotence guards.

The addition of register, retries, delay, until, and changed_when: false establishes a robust retry mechanism that prevents transient API server issues from causing deployment failures. The is succeeded filter is the correct Ansible approach for checking task success.

ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml (1)

15-25: Consistent retry semantics appropriately applied to secret creation.

The retry mechanism is suitable for TLS secret creation operations, which may be sensitive to transient API server delays. The changed_when: false suppression is correct since this task regenerates the YAML from command output on each iteration.

ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml (1)

3-14: Idempotent retry pattern correctly applied to route patch operation.

The addition of retry semantics with changed_when: false ensures the task handles transient failures gracefully without generating spurious change notifications.

ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml (1)

24-43: Consistent retry pattern applied to two-step credential workflow.

Both the generic secret creation and replacement operations properly use retry semantics with changed_when: false to ensure idempotent behavior. The use of --dry-run=client syntax (line 30) is correct and future-compatible.

ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml (1)

21-30: Retry semantics correctly applied to SSH key injection via machineconfig patch.

The idempotent retry pattern is well-suited for SSH key patching operations, as Kubernetes API server responsiveness during initial cluster boot can be variable. The changed_when: false is appropriate since the patch command's idempotent behavior means repeated invocations don't necessitate change notifications.

ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1)

5-23: Critical healthcheck fix: now detects both component unhealthiness and connection failures.

The expanded until condition (lines 11–13) correctly addresses the stated PR objective by checking for both unhealthy components ('False') and API connection errors ('was refused'). This prevents the wait loop from exiting prematurely when the API server is unreachable, which was causing downstream oc command failures.

Please confirm that 'was refused' is the correct error string that appears in oc get co output when the API server connection is refused. You may want to verify this against the actual oc CLI behavior or test output.

ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml (1)

3-12: Robust retry pattern for critical API server certificate configuration.

The idempotent retry semantics with changed_when: false is well-suited for API server patching, as this operation requires reliable API availability and may fail during cluster initialization. The namedCertificates patch will now gracefully recover from transient failures.
ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml (3)
34-44: Consistent retry pattern for client CA configmap creation.

The registration and retry logic with changed_when: false follows the established pattern and appropriately handles transient API failures during configmap creation.

46-55: Idempotent retry pattern for custom CA API server patch.

The retry semantics with changed_when: false ensures the client CA configuration is reliably applied without spurious change reporting.

57-67: Verify deprecated --dry-run flag syntax compatibility.

Line 62 uses --dry-run without an argument, which is the legacy syntax that has been superseded by --dry-run=client or --dry-run=server in modern Kubernetes/OpenShift versions. While the pattern (dry-run | replace) is idiomatic for create-or-replace operations, confirm that the target oc CLI version still supports this syntax to avoid future breakage.

Please verify the minimum OpenShift/oc version required by this playbook and confirm that --dry-run without an argument is still supported. Consider updating to --dry-run=client for forward compatibility:
  ansible.builtin.shell: >
    oc create configmap admin-kubeconfig-client-ca
    -n openshift-config
    --from-file=ca-bundle.crt={{ ca_name }}-ca.crt
-   --dry-run -o yaml | oc replace -f -
+   --dry-run=client -o yaml | oc replace -f -
ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml (3)
4-4: Verify handling of undefined openshift_pull_secret variable.

The when condition uses openshift_pull_secret | length > 0, which will error if the variable is undefined. Ansible typically allows undefined variables in length comparisons, but this should be verified depending on your Ansible version and configuration (especially if error_on_undefined_vars is enabled).

Consider adding a definedness check if strict error handling is preferred:
when: openshift_pull_secret is defined and openshift_pull_secret | length > 0
Otherwise, confirm this pattern is acceptable in your environment.

3-15: Good refactoring of retry semantics and task completion condition.

The changes systematically improve the task's robustness by:

Moving the when condition to the task level (standard Ansible practice)

Adding explicit retry logic with configurable retry count and delay

Using a strict success condition (until: _pull_secret_patch is succeeded) instead of the previous (implied) lenient check

This directly addresses the PR objective that the health check condition was passing too early, causing downstream oc command failures. The no_log: true remains appropriately positioned to protect secrets.

11-15: Retry logic is correctly configured with variables properly defined.

The retry pattern is well-structured and addresses the PR objective by ensuring the task only succeeds when the oc command actually succeeds (exit code 0). The until: _pull_secret_patch is succeeded correctly uses Ansible's fact test to validate task success.

Verification confirms:

max_retries and retry_delay are properly defined in defaults/main.yaml (max_retries: 20, retry_delay: 15)

The retry pattern is consistently used across 9+ task files in the role, demonstrating a well-established pattern

The changed_when: false placement is correct for preventing false "changed" reports during retries

No changes required.

ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml

ansible/roles/deploy-crc-cloud/tasks/login.yaml

During the check, cluster would restart openshift-apiserver, which might fit the condition, but in later steps it would not be accessible for executing other comments. Wait properly for cluster to be stable. Signed-off-by: Daniel Pawlik <[email protected]>

danpawlik added 2 commits December 12, 2025 09:02

Add retry, delay and until on executing oc commands

2a9d97d

The cluster on slow VM's is not ready in enough amount of time. Try to execute command again instead of fail. Signed-off-by: Daniel Pawlik <[email protected]>

Add alternative domain into dnsmasq if set

cabdbf5

The deployment will fail if the altnernative domain would not be found.

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml Show resolved Hide resolved

ansible/roles/deploy-crc-cloud/tasks/login.yaml Show resolved Hide resolved

danpawlik force-pushed the add-retry-delay-until branch from 4b45ac7 to bcb97fb Compare December 12, 2025 09:38

danpawlik force-pushed the add-retry-delay-until branch from bcb97fb to a7f022f Compare December 12, 2025 09:48

praveenkumar merged commit c92f180 into crc-org:main Dec 12, 2025
7 of 10 checks passed

danpawlik deleted the add-retry-delay-until branch December 12, 2025 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry on executing oc commands; fix healthcheck condition#212

Add retry on executing oc commands; fix healthcheck condition#212
praveenkumar merged 3 commits intocrc-org:mainfrom
danpawlik:add-retry-delay-until

danpawlik commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danpawlik commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danpawlik commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading