-
Notifications
You must be signed in to change notification settings - Fork 13
fix: Update liveness/readiness/startup probes and loadbalancer healthchecks to better align with application behavior #478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… checks to better align with application behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
charts/operator-wandb/values.yaml (1)
1013-1051: anaconda2: port-name fix and startupProbe look goodMatches container port “http”; resolves prior mismatch.
🧹 Nitpick comments (5)
charts/operator-wandb/templates/global.yaml (1)
14-14: Parameterize HTTP shutdown grace; define global.httpShutdownGracePeriodAdd under global in charts/operator-wandb/values.yaml:
global: # HTTP shutdown grace period; should match pod terminationGracePeriodSeconds + preStop sleep httpShutdownGracePeriod: "15s"Then update charts/operator-wandb/templates/global.yaml:
- GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD: "15s" + GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD: {{ .Values.global.httpShutdownGracePeriod | quote }}Run to verify consistency across charts:
rg -nP 'terminationGracePeriodSeconds|preStop|GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD' -C2charts/operator-wandb/values.yaml (4)
352-373: Startup probe cadence is very aggressive; consider a small timeoutperiodSeconds: 1 with default timeoutSeconds: 1 can be chatty under load. Add timeoutSeconds: 2–5 to reduce false negatives during cold starts.
startupProbe: httpGet: path: /ready port: api - periodSeconds: 1 + timeoutSeconds: 2 + periodSeconds: 1 failureThreshold: 600
733-751: Add startupProbe.timeoutSeconds for consistencySame rationale as api: a small timeout reduces flakiness.
startupProbe: httpGet: path: /console/api/ready port: http - periodSeconds: 1 + timeoutSeconds: 2 + periodSeconds: 1
1441-1451: glue lacks a readinessProbeYou added liveness and startup, but no readiness means traffic may arrive as soon as startup completes. Consider adding readiness mirroring liveness.
livenessProbe: httpGet: path: /healthz port: http periodSeconds: 1 timeoutSeconds: 1 successThreshold: 1 failureThreshold: 3 + readinessProbe: + httpGet: + path: /healthz + port: http + periodSeconds: 2 + timeoutSeconds: 2 + successThreshold: 1 + failureThreshold: 3 startupProbe: httpGet: path: /healthz port: http failureThreshold: 60 periodSeconds: 1
1650-1667: Parquet probes: named ports and thresholds look saneOptional: set readiness timeoutSeconds (e.g., 5) to match liveness behavior.
readinessProbe: httpGet: path: /ready port: parquet periodSeconds: 5 + timeoutSeconds: 5 successThreshold: 2 failureThreshold: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (12)
test-configs/operator-wandb/__snapshots__/anaconda2.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/default.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/executor-enabled.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/fmb.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/runs-v2-bufstream.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/separate-pods-filemeta.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/separate-pods.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/snap-enable-backfill.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/url-encoded-password.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/user-defined-secrets.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/weave-trace-with-worker.snapis excluded by!**/*.snaptest-configs/operator-wandb/__snapshots__/weave-trace.snapis excluded by!**/*.snap
📒 Files selected for processing (3)
charts/operator-wandb/Chart.yaml(1 hunks)charts/operator-wandb/templates/global.yaml(1 hunks)charts/operator-wandb/values.yaml(18 hunks)
✅ Files skipped from review due to trivial changes (1)
- charts/operator-wandb/Chart.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: Test Chart (v1.30.10, weave-trace)
- GitHub Check: Test Chart (v1.30.10, url-encoded-password)
- GitHub Check: Test Chart (v1.30.10, weave-trace-with-worker)
- GitHub Check: Test Chart (v1.30.10, user-defined-secrets)
- GitHub Check: Test Chart (v1.30.10, separate-pods-filemeta)
- GitHub Check: Test Chart (v1.30.10, default)
- GitHub Check: Test Chart (v1.30.10, separate-pods)
- GitHub Check: Test Chart (v1.30.10, fmb)
- GitHub Check: Test Chart (v1.31.6, weave-trace-with-worker)
- GitHub Check: Test Chart (v1.30.10, anaconda2)
- GitHub Check: Test Chart (v1.31.6, weave-trace)
- GitHub Check: Test Chart (v1.31.6, url-encoded-password)
- GitHub Check: Test Chart (v1.31.6, user-defined-secrets)
- GitHub Check: Test Chart (v1.30.10, executor-enabled)
- GitHub Check: Test Chart (v1.31.6, fmb)
- GitHub Check: Test Chart (v1.31.6, anaconda2)
- GitHub Check: Test Chart (v1.31.6, separate-pods-filemeta)
- GitHub Check: Test Chart (v1.31.6, executor-enabled)
- GitHub Check: Test Chart (v1.31.6, separate-pods)
- GitHub Check: Test Chart (v1.32.2, weave-trace)
- GitHub Check: Test Chart (v1.32.2, user-defined-secrets)
- GitHub Check: Test Chart (v1.31.6, default)
- GitHub Check: Test Chart (v1.32.2, weave-trace-with-worker)
- GitHub Check: Test Chart (v1.32.2, anaconda2)
- GitHub Check: Test Chart (v1.32.2, separate-pods-filemeta)
- GitHub Check: Test Chart (v1.32.2, executor-enabled)
- GitHub Check: Test Chart (v1.32.2, url-encoded-password)
- GitHub Check: Test Chart (v1.32.2, separate-pods)
- GitHub Check: Test Chart (v1.32.2, fmb)
- GitHub Check: Test Chart (v1.32.2, default)
🔇 Additional comments (9)
charts/operator-wandb/values.yaml (9)
289-292: LB health check path for api looks goodPath set to /ready matches readiness probe and provider-agnostic schema.
706-709: Console LB health check path set appropriatelyAligns with console readiness endpoint.
1296-1299: frontend LB health check path set; good alignmentMatches service’s readiness endpoint.
1834-1834: Weave service: LB disabled explicitlyExplicit disable is clearer than relying on provider defaults.
1988-1988: weave-trace liveness failureThreshold: 10Reasonable for transient spikes; no action.
1999-2000: weave-trace startupProbe: 5s/24 cycles (~2 min)Good balance between boot time and failure responsiveness.
2010-2012: weave-trace LB health check path setMatches app health endpoint; aligns with base LB schema.
553-571: No action required: shutdown budget covered
terminationGracePeriodSeconds defaults to 75 s in wandb-base (charts/wandb-base/values.yaml), exceeding preStop 25 s + GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD 15 s (40 s total), leaving ~35 s for load-balancer drain.
1928-1931: Validate cluster node memory capacity
Ensure your Kubernetes nodes have at least 48 GiB of allocatable memory (limit) and can handle a 32 GiB request per pod without causing OOM/kube-scheduler failures.
Location: charts/operator-wandb/values.yaml:1928-1931
| loadBalancer: | ||
| timeoutSeconds: 120 | ||
| connectionDraining: | ||
| enabled: true | ||
| drainingTimeoutSeconds: 60 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Key name mismatch: timeoutSeconds/drainingTimeoutSeconds won’t be consumed by base templates
Base expects timeoutSec/drainingTimeoutSec. Current keys will be ignored, breaking LB timeouts/connection draining.
Apply:
loadBalancer:
- timeoutSeconds: 120
+ timeoutSec: 120
connectionDraining:
enabled: true
- drainingTimeoutSeconds: 60
+ drainingTimeoutSec: 60Then verify no remaining mismatches:
🏁 Script executed:
#!/bin/bash
rg -nP '\btimeoutSeconds\b|drainingTimeoutSeconds\b'
rg -nP 'Values\.global\.loadBalancer\.(timeoutSec|drainingTimeoutSec)' -C2Length of output: 24793
Update loadBalancer keys and template references to use timeoutSec/drainingTimeoutSec
- In charts/operator-wandb/values.yaml (lines 199–204), rename
timeoutSeconds: 120 → timeoutSec: 120 connectionDraining: drainingTimeoutSeconds: 60 → drainingTimeoutSec: 60
- In charts/operator-wandb/templates/ingress.yaml (lines 61, 63, 121, 123), change
.Values.global.loadBalancer.timeoutSeconds → .Values.global.loadBalancer.timeoutSec .Values.global.loadBalancer.connectionDraining.drainingTimeoutSeconds → .Values.global.loadBalancer.connectionDraining.drainingTimeoutSec
- In charts/wandb-base/templates/service.yaml (lines 18, 22, 45, 50), update
$loadBalancer.healthCheck.timeoutSeconds → $loadBalancer.healthCheck.timeoutSec $loadBalancer.connectionDraining.drainingTimeoutSeconds → $loadBalancer.connectionDraining.drainingTimeoutSec
After applying, re-run your ripgrep to confirm no remaining timeoutSeconds or drainingTimeoutSeconds references.
🤖 Prompt for AI Agents
In charts/operator-wandb/values.yaml lines 199–204, rename the loadBalancer keys
timeoutSeconds → timeoutSec and connectionDraining.drainingTimeoutSeconds →
connectionDraining.drainingTimeoutSec; then update the template references in
charts/operator-wandb/templates/ingress.yaml (around lines 61, 63, 121, 123)
replacing .Values.global.loadBalancer.timeoutSeconds →
.Values.global.loadBalancer.timeoutSec and
.Values.global.loadBalancer.connectionDraining.drainingTimeoutSeconds →
.Values.global.loadBalancer.connectionDraining.drainingTimeoutSec; also update
charts/wandb-base/templates/service.yaml (around lines 18, 22, 45, 50) replacing
$loadBalancer.healthCheck.timeoutSeconds → $loadBalancer.healthCheck.timeoutSec
and $loadBalancer.connectionDraining.drainingTimeoutSeconds →
$loadBalancer.connectionDraining.drainingTimeoutSec; after making these changes
run ripgrep to verify there are no remaining timeoutSeconds or
drainingTimeoutSeconds occurrences.
Summary by CodeRabbit
New Features
Chores