fix: Update liveness/readiness/startup probes and loadbalancer healthchecks to better align with application behavior #478

danielpanzella · 2025-08-29T18:26:09Z

Summary by CodeRabbit

New Features
- Centralized, provider-aware load balancer controls (timeout, connection draining) with per-service enable/disable and provider-specific health-check annotations; added Azure ingress timeout/drain annotations and new LB health-check paths for services.
Chores
- Replaced numeric probe ports with named ports, added startup probes and adjusted probe cadences/removals of initial delays, increased termination grace, raised memory defaults for select components, added HTTP shutdown grace config, and bumped chart versions.

… checks to better align with application behavior

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

charts/operator-wandb/values.yaml (1)

1013-1051: anaconda2: port-name fix and startupProbe look good

Matches container port “http”; resolves prior mismatch.

🧹 Nitpick comments (5)

charts/operator-wandb/templates/global.yaml (1)
14-14: Parameterize HTTP shutdown grace; define global.httpShutdownGracePeriod

Add under global in charts/operator-wandb/values.yaml:
global:
  # HTTP shutdown grace period; should match pod terminationGracePeriodSeconds + preStop sleep
  httpShutdownGracePeriod: "15s"
Then update charts/operator-wandb/templates/global.yaml:
-  GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD: "15s"
+  GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD: {{ .Values.global.httpShutdownGracePeriod | quote }}
Run to verify consistency across charts:
rg -nP 'terminationGracePeriodSeconds|preStop|GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD' -C2
charts/operator-wandb/values.yaml (4)
352-373: Startup probe cadence is very aggressive; consider a small timeout

periodSeconds: 1 with default timeoutSeconds: 1 can be chatty under load. Add timeoutSeconds: 2–5 to reduce false negatives during cold starts.
       startupProbe:
         httpGet:
           path: /ready
           port: api
-        periodSeconds: 1
+        timeoutSeconds: 2
+        periodSeconds: 1
         failureThreshold: 600
733-751: Add startupProbe.timeoutSeconds for consistency

Same rationale as api: a small timeout reduces flakiness.
       startupProbe:
         httpGet:
           path: /console/api/ready
           port: http
-        periodSeconds: 1
+        timeoutSeconds: 2
+        periodSeconds: 1
1441-1451: glue lacks a readinessProbe

You added liveness and startup, but no readiness means traffic may arrive as soon as startup completes. Consider adding readiness mirroring liveness.
       livenessProbe:
         httpGet:
           path: /healthz
           port: http
         periodSeconds: 1
         timeoutSeconds: 1
         successThreshold: 1
         failureThreshold: 3
+      readinessProbe:
+        httpGet:
+          path: /healthz
+          port: http
+        periodSeconds: 2
+        timeoutSeconds: 2
+        successThreshold: 1
+        failureThreshold: 3
       startupProbe:
         httpGet:
           path: /healthz
           port: http
         failureThreshold: 60
         periodSeconds: 1
1650-1667: Parquet probes: named ports and thresholds look sane

Optional: set readiness timeoutSeconds (e.g., 5) to match liveness behavior.
       readinessProbe:
         httpGet:
           path: /ready
           port: parquet
         periodSeconds: 5
+        timeoutSeconds: 5
         successThreshold: 2
         failureThreshold: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be039e9 and 65f3d22.

⛔ Files ignored due to path filters (12)

test-configs/operator-wandb/__snapshots__/anaconda2.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/default.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/executor-enabled.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/fmb.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/runs-v2-bufstream.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/separate-pods-filemeta.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/separate-pods.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/snap-enable-backfill.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/url-encoded-password.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/user-defined-secrets.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/weave-trace-with-worker.snap is excluded by !**/*.snap
test-configs/operator-wandb/__snapshots__/weave-trace.snap is excluded by !**/*.snap

📒 Files selected for processing (3)

charts/operator-wandb/Chart.yaml (1 hunks)
charts/operator-wandb/templates/global.yaml (1 hunks)
charts/operator-wandb/values.yaml (18 hunks)

✅ Files skipped from review due to trivial changes (1)

charts/operator-wandb/Chart.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)

GitHub Check: Test Chart (v1.30.10, weave-trace)
GitHub Check: Test Chart (v1.30.10, url-encoded-password)
GitHub Check: Test Chart (v1.30.10, weave-trace-with-worker)
GitHub Check: Test Chart (v1.30.10, user-defined-secrets)
GitHub Check: Test Chart (v1.30.10, separate-pods-filemeta)
GitHub Check: Test Chart (v1.30.10, default)
GitHub Check: Test Chart (v1.30.10, separate-pods)
GitHub Check: Test Chart (v1.30.10, fmb)
GitHub Check: Test Chart (v1.31.6, weave-trace-with-worker)
GitHub Check: Test Chart (v1.30.10, anaconda2)
GitHub Check: Test Chart (v1.31.6, weave-trace)
GitHub Check: Test Chart (v1.31.6, url-encoded-password)
GitHub Check: Test Chart (v1.31.6, user-defined-secrets)
GitHub Check: Test Chart (v1.30.10, executor-enabled)
GitHub Check: Test Chart (v1.31.6, fmb)
GitHub Check: Test Chart (v1.31.6, anaconda2)
GitHub Check: Test Chart (v1.31.6, separate-pods-filemeta)
GitHub Check: Test Chart (v1.31.6, executor-enabled)
GitHub Check: Test Chart (v1.31.6, separate-pods)
GitHub Check: Test Chart (v1.32.2, weave-trace)
GitHub Check: Test Chart (v1.32.2, user-defined-secrets)
GitHub Check: Test Chart (v1.31.6, default)
GitHub Check: Test Chart (v1.32.2, weave-trace-with-worker)
GitHub Check: Test Chart (v1.32.2, anaconda2)
GitHub Check: Test Chart (v1.32.2, separate-pods-filemeta)
GitHub Check: Test Chart (v1.32.2, executor-enabled)
GitHub Check: Test Chart (v1.32.2, url-encoded-password)
GitHub Check: Test Chart (v1.32.2, separate-pods)
GitHub Check: Test Chart (v1.32.2, fmb)
GitHub Check: Test Chart (v1.32.2, default)

🔇 Additional comments (9)

charts/operator-wandb/values.yaml (9)

289-292: LB health check path for api looks good

Path set to /ready matches readiness probe and provider-agnostic schema.

706-709: Console LB health check path set appropriately

Aligns with console readiness endpoint.

1296-1299: frontend LB health check path set; good alignment

Matches service’s readiness endpoint.

1834-1834: Weave service: LB disabled explicitly

Explicit disable is clearer than relying on provider defaults.

1988-1988: weave-trace liveness failureThreshold: 10

Reasonable for transient spikes; no action.

1999-2000: weave-trace startupProbe: 5s/24 cycles (~2 min)

Good balance between boot time and failure responsiveness.

2010-2012: weave-trace LB health check path set

Matches app health endpoint; aligns with base LB schema.

553-571: No action required: shutdown budget covered
terminationGracePeriodSeconds defaults to 75 s in wandb-base (charts/wandb-base/values.yaml), exceeding preStop 25 s + GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD 15 s (40 s total), leaving ~35 s for load-balancer drain.

1928-1931: Validate cluster node memory capacity
Ensure your Kubernetes nodes have at least 48 GiB of allocatable memory (limit) and can handle a 32 GiB request per pod without causing OOM/kube-scheduler failures.
Location: charts/operator-wandb/values.yaml:1928-1931

coderabbitai · 2025-09-10T17:58:02Z

charts/operator-wandb/values.yaml

+  loadBalancer:
+    timeoutSeconds: 120
+    connectionDraining:
+      enabled: true
+      drainingTimeoutSeconds: 60
+


💡 Verification agent

🧩 Analysis chain

Key name mismatch: timeoutSeconds/drainingTimeoutSeconds won’t be consumed by base templates

Base expects timeoutSec/drainingTimeoutSec. Current keys will be ignored, breaking LB timeouts/connection draining.

Apply:

loadBalancer: - timeoutSeconds: 120 + timeoutSec: 120 connectionDraining: enabled: true - drainingTimeoutSeconds: 60 + drainingTimeoutSec: 60

Then verify no remaining mismatches:

🏁 Script executed:

#!/bin/bash rg -nP '\btimeoutSeconds\b|drainingTimeoutSeconds\b' rg -nP 'Values\.global\.loadBalancer\.(timeoutSec|drainingTimeoutSec)' -C2

Length of output: 24793

Update loadBalancer keys and template references to use timeoutSec/drainingTimeoutSec

In charts/operator-wandb/values.yaml (lines 199–204), rename
timeoutSeconds: 120 → timeoutSec: 120 connectionDraining: drainingTimeoutSeconds: 60 → drainingTimeoutSec: 60

In charts/operator-wandb/templates/ingress.yaml (lines 61, 63, 121, 123), change
.Values.global.loadBalancer.timeoutSeconds → .Values.global.loadBalancer.timeoutSec .Values.global.loadBalancer.connectionDraining.drainingTimeoutSeconds → .Values.global.loadBalancer.connectionDraining.drainingTimeoutSec

In charts/wandb-base/templates/service.yaml (lines 18, 22, 45, 50), update
$loadBalancer.healthCheck.timeoutSeconds → $loadBalancer.healthCheck.timeoutSec $loadBalancer.connectionDraining.drainingTimeoutSeconds → $loadBalancer.connectionDraining.drainingTimeoutSec

After applying, re-run your ripgrep to confirm no remaining timeoutSeconds or drainingTimeoutSeconds references.

🤖 Prompt for AI Agents

In charts/operator-wandb/values.yaml lines 199–204, rename the loadBalancer keys timeoutSeconds → timeoutSec and connectionDraining.drainingTimeoutSeconds → connectionDraining.drainingTimeoutSec; then update the template references in charts/operator-wandb/templates/ingress.yaml (around lines 61, 63, 121, 123) replacing .Values.global.loadBalancer.timeoutSeconds → .Values.global.loadBalancer.timeoutSec and .Values.global.loadBalancer.connectionDraining.drainingTimeoutSeconds → .Values.global.loadBalancer.connectionDraining.drainingTimeoutSec; also update charts/wandb-base/templates/service.yaml (around lines 18, 22, 45, 50) replacing $loadBalancer.healthCheck.timeoutSeconds → $loadBalancer.healthCheck.timeoutSec and $loadBalancer.connectionDraining.drainingTimeoutSeconds → $loadBalancer.connectionDraining.drainingTimeoutSec; after making these changes run ripgrep to verify there are no remaining timeoutSeconds or drainingTimeoutSeconds occurrences.

fix: Update liveness/readiness/startup probes and loadbalancer health…

251a3f5

… checks to better align with application behavior

danielpanzella requested review from dacbd, jsbroks, vanpelt and zacharyblasczyk as code owners August 29, 2025 18:26

danielpanzella had a problem deploying to Helm Charts August 29, 2025 18:26 — with GitHub Actions Failure

danielpanzella temporarily deployed to Helm Charts September 10, 2025 17:48 — with GitHub Actions Inactive

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Update liveness/readiness/startup probes and loadbalancer healthchecks to better align with application behavior #478

fix: Update liveness/readiness/startup probes and loadbalancer healthchecks to better align with application behavior #478

Uh oh!

danielpanzella commented Aug 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Update liveness/readiness/startup probes and loadbalancer healthchecks to better align with application behavior #478

Are you sure you want to change the base?

fix: Update liveness/readiness/startup probes and loadbalancer healthchecks to better align with application behavior #478

Uh oh!

Conversation

danielpanzella commented Aug 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielpanzella commented Aug 29, 2025 •

edited by coderabbitai bot

Loading