Skip to content

Conversation

@danpawlik
Copy link
Collaborator

@danpawlik danpawlik commented Dec 12, 2025

The health check condition was passing to early, which causes errors in other places where OC command was executed.
Also updated dnsmasq configuration when additional_domain was set.

Summary by CodeRabbit

  • New Features

    • Support for alternative domain entries in DNS configuration and ensured dnsmasq service is enabled.
  • Bug Fixes

    • Improved cluster health checks to detect connection refusals.
    • Suppressed false-positive change reports during repeated operations.
  • Improvements

    • Broad retry/resilience added to many deployment steps with configurable delay.
    • Standardized success checks and more reliable host extraction logic.

✏️ Tip: You can customize this high-level summary in your review settings.

The cluster on slow VM's is not ready in enough amount of time.
Try to execute command again instead of fail.

Signed-off-by: Daniel Pawlik <[email protected]>
The deployment will fail if the altnernative domain
would not be found.
@coderabbitai
Copy link

coderabbitai bot commented Dec 12, 2025

Walkthrough

Added retry loops, standardized success checks to Ansible's is_succeeded, and replaced fixed delays with {{ retry_delay }} across deploy-crc-cloud tasks; switched some JSON extraction from jq to jsonpath; updated dnsmasq conditional block handling and refined cluster health wait conditions.

Changes

Cohort / File(s) Summary
Route & Login
ansible/roles/deploy-crc-cloud/tasks/console_route.yaml, ansible/roles/deploy-crc-cloud/tasks/login.yaml
Switched success checks to is_succeeded, replaced fixed delay: 10 with delay: "{{ retry_delay }}", added retry parameters; console route host extraction changed from jq to a jsonpath expression.
DNS (dnsmasq)
ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml
Added conditional blockinfile to insert alternative domain mappings when defined; restart now triggers on _dnsmasq_conf.changed or _dnsmasq_alt_domain_conf.changed; ensured dnsmasq service is enabled in the systemd task.
Certificate & Secret Management
ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml, ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml
Added register, retries, delay, until: <var> is succeeded, and changed_when: false to TLS secret and htpasswd secret tasks to enable retries and suppress intermediate change reporting.
Pull Secret & Pubkey Patch
ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml, ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml
Moved when into task body (for pull secret), preserved no_log: true, and added register, retry params, until: ... is succeeded, and changed_when: false to make patch operations retryable without spurious change marks.
Cluster Patches & Ingress
ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml, ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml, ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml
Introduced result registration, retries/delay using variables, until: <var> is succeeded, and changed_when: false so patch commands run in retry loops and avoid repeated change reporting.
Replace Default CA
ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml
Added registers and retry loops with until: ... is succeeded and changed_when: false for configmap creation and apiserver patch steps used when replacing the default CA.
Cluster Health Wait
ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml
Expanded until to ensure neither "'False'" nor "'was refused'" appear in component status output; removed previous failed_when and retained ignore_errors: true.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review focus:
    • console_route.yaml: verify jsonpath equivalence to prior jq across response shapes.
    • dnsmasq.yaml: confirm combined restart triggers and enabled flag behave on all platforms.
    • Tasks with changed_when: false: ensure final successful operations still surface appropriate change at play level.
    • wait_cluster_become_healthy.yaml: confirm added "'was refused'" check doesn't mask actionable failures.

Possibly related PRs

Suggested reviewers

  • sdatko

Poem

🐇 I hopped through tasks both bold and bright,
Retries in my pouch to fix the night.
Delay now dances, checks succeed with grace,
Jsonpath found the host in place —
A rabbit's cheer for resilient space! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objectives of the PR: adding retry logic to oc commands and fixing the healthcheck condition.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcb97fb and a7f022f.

📒 Files selected for processing (2)
  • ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml (2 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build (macOS-latest, 1.20)
  • GitHub Check: build-and-push-image
  • GitHub Check: build (ubuntu-latest, 1.20)
🔇 Additional comments (1)
ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1)

16-19: Success message is OK, but it may mislead if the probe didn’t actually validate anything
If you keep ignore_errors: true, the message becomes more trustworthy once the probe is gated on rc + non-empty output (above).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (1)

7-9: Standardize retry configuration across both tasks.

The two tasks use inconsistent retry counts: Task 1 hardcodes retries: 60 while Task 2 uses the parameterized retries: "{{ max_retries }}". For consistency and maintainability, consider parameterizing the first task as well.

Apply this diff to align the retry configuration:

- name: Get route to console custom
  ansible.builtin.shell: |
    oc get route console-custom -n openshift-console
  register: _route_console_custom
  until: _route_console_custom is succeeded
-  retries: 60
+  retries: "{{ max_retries }}"
  delay: "{{ retry_delay }}"
  changed_when: false
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c98532 and 4b45ac7.

📒 Files selected for processing (12)
  • ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/dnsmasq.yaml (2 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/login.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml (1 hunks)
  • ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build (macOS-latest, 1.20)
  • GitHub Check: build (ubuntu-latest, 1.20)
  • GitHub Check: build-and-push-image
🔇 Additional comments (15)
ansible/roles/deploy-crc-cloud/tasks/console_route.yaml (2)

16-16: Verify jsonpath extraction handles missing or empty host values.

The jsonpath expression '{.spec.host}' will succeed (exit code 0) even if the host field is empty or missing, returning an empty string. Since the task uses until: _console_spec_host is succeeded, it may consider the operation successful even when the route is not fully ready.

Consider adding a fact validation step or adjusting the condition to ensure the extracted host is not empty.

Please verify that downstream tasks properly validate the _console_spec_host.stdout value to ensure it is not empty before using it.


16-16: Variables retry_delay and max_retries are already properly defined.

Both variables are defined in ansible/roles/deploy-crc-cloud/defaults/main.yaml (max_retries: 20, retry_delay: 15) and will not cause runtime errors.

ansible/roles/deploy-crc-cloud/tasks/patch_ingress_config.yaml (1)

21-31: Well-structured retry pattern with proper idempotence guards.

The addition of register, retries, delay, until, and changed_when: false establishes a robust retry mechanism that prevents transient API server issues from causing deployment failures. The is succeeded filter is the correct Ansible approach for checking task success.

ansible/roles/deploy-crc-cloud/tasks/create_certificate_and_patch_secret.yaml (1)

15-25: Consistent retry semantics appropriately applied to secret creation.

The retry mechanism is suitable for TLS secret creation operations, which may be sensitive to transient API server delays. The changed_when: false suppression is correct since this task regenerates the YAML from command output on each iteration.

ansible/roles/deploy-crc-cloud/tasks/patch_default_route.yaml (1)

3-14: Idempotent retry pattern correctly applied to route patch operation.

The addition of retry semantics with changed_when: false ensures the task handles transient failures gracefully without generating spurious change notifications.

ansible/roles/deploy-crc-cloud/tasks/set_credentials.yaml (1)

24-43: Consistent retry pattern applied to two-step credential workflow.

Both the generic secret creation and replacement operations properly use retry semantics with changed_when: false to ensure idempotent behavior. The use of --dry-run=client syntax (line 30) is correct and future-compatible.

ansible/roles/deploy-crc-cloud/tasks/pubkey.yaml (1)

21-30: Retry semantics correctly applied to SSH key injection via machineconfig patch.

The idempotent retry pattern is well-suited for SSH key patching operations, as Kubernetes API server responsiveness during initial cluster boot can be variable. The changed_when: false is appropriate since the patch command's idempotent behavior means repeated invocations don't necessitate change notifications.

ansible/roles/deploy-crc-cloud/tasks/wait_cluster_become_healthy.yaml (1)

5-23: Critical healthcheck fix: now detects both component unhealthiness and connection failures.

The expanded until condition (lines 11–13) correctly addresses the stated PR objective by checking for both unhealthy components ('False') and API connection errors ('was refused'). This prevents the wait loop from exiting prematurely when the API server is unreachable, which was causing downstream oc command failures.

Please confirm that 'was refused' is the correct error string that appears in oc get co output when the API server connection is refused. You may want to verify this against the actual oc CLI behavior or test output.

ansible/roles/deploy-crc-cloud/tasks/patch_api_server.yaml (1)

3-12: Robust retry pattern for critical API server certificate configuration.

The idempotent retry semantics with changed_when: false is well-suited for API server patching, as this operation requires reliable API availability and may fail during cluster initialization. The namedCertificates patch will now gracefully recover from transient failures.

ansible/roles/deploy-crc-cloud/tasks/replace_default_ca.yaml (3)

34-44: Consistent retry pattern for client CA configmap creation.

The registration and retry logic with changed_when: false follows the established pattern and appropriately handles transient API failures during configmap creation.


46-55: Idempotent retry pattern for custom CA API server patch.

The retry semantics with changed_when: false ensures the client CA configuration is reliably applied without spurious change reporting.


57-67: Verify deprecated --dry-run flag syntax compatibility.

Line 62 uses --dry-run without an argument, which is the legacy syntax that has been superseded by --dry-run=client or --dry-run=server in modern Kubernetes/OpenShift versions. While the pattern (dry-run | replace) is idiomatic for create-or-replace operations, confirm that the target oc CLI version still supports this syntax to avoid future breakage.

Please verify the minimum OpenShift/oc version required by this playbook and confirm that --dry-run without an argument is still supported. Consider updating to --dry-run=client for forward compatibility:

  ansible.builtin.shell: >
    oc create configmap admin-kubeconfig-client-ca
    -n openshift-config
    --from-file=ca-bundle.crt={{ ca_name }}-ca.crt
-   --dry-run -o yaml | oc replace -f -
+   --dry-run=client -o yaml | oc replace -f -
ansible/roles/deploy-crc-cloud/tasks/patch_pull_secret.yaml (3)

4-4: Verify handling of undefined openshift_pull_secret variable.

The when condition uses openshift_pull_secret | length > 0, which will error if the variable is undefined. Ansible typically allows undefined variables in length comparisons, but this should be verified depending on your Ansible version and configuration (especially if error_on_undefined_vars is enabled).

Consider adding a definedness check if strict error handling is preferred:

when: openshift_pull_secret is defined and openshift_pull_secret | length > 0

Otherwise, confirm this pattern is acceptable in your environment.


3-15: Good refactoring of retry semantics and task completion condition.

The changes systematically improve the task's robustness by:

  • Moving the when condition to the task level (standard Ansible practice)
  • Adding explicit retry logic with configurable retry count and delay
  • Using a strict success condition (until: _pull_secret_patch is succeeded) instead of the previous (implied) lenient check

This directly addresses the PR objective that the health check condition was passing too early, causing downstream oc command failures. The no_log: true remains appropriately positioned to protect secrets.


11-15: Retry logic is correctly configured with variables properly defined.

The retry pattern is well-structured and addresses the PR objective by ensuring the task only succeeds when the oc command actually succeeds (exit code 0). The until: _pull_secret_patch is succeeded correctly uses Ansible's fact test to validate task success.

Verification confirms:

  • max_retries and retry_delay are properly defined in defaults/main.yaml (max_retries: 20, retry_delay: 15)
  • The retry pattern is consistently used across 9+ task files in the role, demonstrating a well-established pattern
  • The changed_when: false placement is correct for preventing false "changed" reports during retries

No changes required.

@danpawlik danpawlik force-pushed the add-retry-delay-until branch from 4b45ac7 to bcb97fb Compare December 12, 2025 09:38
During the check, cluster would restart openshift-apiserver, which
might fit the condition, but in later steps it would not be accessible
for executing other comments.
Wait properly for cluster to be stable.

Signed-off-by: Daniel Pawlik <[email protected]>
@danpawlik danpawlik force-pushed the add-retry-delay-until branch from bcb97fb to a7f022f Compare December 12, 2025 09:48
@praveenkumar praveenkumar merged commit c92f180 into crc-org:main Dec 12, 2025
7 of 10 checks passed
@danpawlik danpawlik deleted the add-retry-delay-until branch December 12, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants