Skip to content

Conversation

@danielpanzella
Copy link
Contributor

@danielpanzella danielpanzella commented Aug 29, 2025

Summary by CodeRabbit

  • New Features

    • Centralized, provider-aware load balancer controls (timeout, connection draining) with per-service enable/disable and provider-specific health-check annotations; added Azure ingress timeout/drain annotations and new LB health-check paths for services.
  • Chores

    • Replaced numeric probe ports with named ports, added startup probes and adjusted probe cadences/removals of initial delays, increased termination grace, raised memory defaults for select components, added HTTP shutdown grace config, and bumped chart versions.

… checks to better align with application behavior
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
charts/operator-wandb/values.yaml (1)

1013-1051: anaconda2: port-name fix and startupProbe look good

Matches container port “http”; resolves prior mismatch.

🧹 Nitpick comments (5)
charts/operator-wandb/templates/global.yaml (1)

14-14: Parameterize HTTP shutdown grace; define global.httpShutdownGracePeriod

Add under global in charts/operator-wandb/values.yaml:

global:
  # HTTP shutdown grace period; should match pod terminationGracePeriodSeconds + preStop sleep
  httpShutdownGracePeriod: "15s"

Then update charts/operator-wandb/templates/global.yaml:

-  GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD: "15s"
+  GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD: {{ .Values.global.httpShutdownGracePeriod | quote }}

Run to verify consistency across charts:

rg -nP 'terminationGracePeriodSeconds|preStop|GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD' -C2
charts/operator-wandb/values.yaml (4)

352-373: Startup probe cadence is very aggressive; consider a small timeout

periodSeconds: 1 with default timeoutSeconds: 1 can be chatty under load. Add timeoutSeconds: 2–5 to reduce false negatives during cold starts.

       startupProbe:
         httpGet:
           path: /ready
           port: api
-        periodSeconds: 1
+        timeoutSeconds: 2
+        periodSeconds: 1
         failureThreshold: 600

733-751: Add startupProbe.timeoutSeconds for consistency

Same rationale as api: a small timeout reduces flakiness.

       startupProbe:
         httpGet:
           path: /console/api/ready
           port: http
-        periodSeconds: 1
+        timeoutSeconds: 2
+        periodSeconds: 1

1441-1451: glue lacks a readinessProbe

You added liveness and startup, but no readiness means traffic may arrive as soon as startup completes. Consider adding readiness mirroring liveness.

       livenessProbe:
         httpGet:
           path: /healthz
           port: http
         periodSeconds: 1
         timeoutSeconds: 1
         successThreshold: 1
         failureThreshold: 3
+      readinessProbe:
+        httpGet:
+          path: /healthz
+          port: http
+        periodSeconds: 2
+        timeoutSeconds: 2
+        successThreshold: 1
+        failureThreshold: 3
       startupProbe:
         httpGet:
           path: /healthz
           port: http
         failureThreshold: 60
         periodSeconds: 1

1650-1667: Parquet probes: named ports and thresholds look sane

Optional: set readiness timeoutSeconds (e.g., 5) to match liveness behavior.

       readinessProbe:
         httpGet:
           path: /ready
           port: parquet
         periodSeconds: 5
+        timeoutSeconds: 5
         successThreshold: 2
         failureThreshold: 2
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be039e9 and 65f3d22.

⛔ Files ignored due to path filters (12)
  • test-configs/operator-wandb/__snapshots__/anaconda2.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/default.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/executor-enabled.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/fmb.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/runs-v2-bufstream.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/separate-pods-filemeta.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/separate-pods.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/snap-enable-backfill.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/url-encoded-password.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/user-defined-secrets.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/weave-trace-with-worker.snap is excluded by !**/*.snap
  • test-configs/operator-wandb/__snapshots__/weave-trace.snap is excluded by !**/*.snap
📒 Files selected for processing (3)
  • charts/operator-wandb/Chart.yaml (1 hunks)
  • charts/operator-wandb/templates/global.yaml (1 hunks)
  • charts/operator-wandb/values.yaml (18 hunks)
✅ Files skipped from review due to trivial changes (1)
  • charts/operator-wandb/Chart.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: Test Chart (v1.30.10, weave-trace)
  • GitHub Check: Test Chart (v1.30.10, url-encoded-password)
  • GitHub Check: Test Chart (v1.30.10, weave-trace-with-worker)
  • GitHub Check: Test Chart (v1.30.10, user-defined-secrets)
  • GitHub Check: Test Chart (v1.30.10, separate-pods-filemeta)
  • GitHub Check: Test Chart (v1.30.10, default)
  • GitHub Check: Test Chart (v1.30.10, separate-pods)
  • GitHub Check: Test Chart (v1.30.10, fmb)
  • GitHub Check: Test Chart (v1.31.6, weave-trace-with-worker)
  • GitHub Check: Test Chart (v1.30.10, anaconda2)
  • GitHub Check: Test Chart (v1.31.6, weave-trace)
  • GitHub Check: Test Chart (v1.31.6, url-encoded-password)
  • GitHub Check: Test Chart (v1.31.6, user-defined-secrets)
  • GitHub Check: Test Chart (v1.30.10, executor-enabled)
  • GitHub Check: Test Chart (v1.31.6, fmb)
  • GitHub Check: Test Chart (v1.31.6, anaconda2)
  • GitHub Check: Test Chart (v1.31.6, separate-pods-filemeta)
  • GitHub Check: Test Chart (v1.31.6, executor-enabled)
  • GitHub Check: Test Chart (v1.31.6, separate-pods)
  • GitHub Check: Test Chart (v1.32.2, weave-trace)
  • GitHub Check: Test Chart (v1.32.2, user-defined-secrets)
  • GitHub Check: Test Chart (v1.31.6, default)
  • GitHub Check: Test Chart (v1.32.2, weave-trace-with-worker)
  • GitHub Check: Test Chart (v1.32.2, anaconda2)
  • GitHub Check: Test Chart (v1.32.2, separate-pods-filemeta)
  • GitHub Check: Test Chart (v1.32.2, executor-enabled)
  • GitHub Check: Test Chart (v1.32.2, url-encoded-password)
  • GitHub Check: Test Chart (v1.32.2, separate-pods)
  • GitHub Check: Test Chart (v1.32.2, fmb)
  • GitHub Check: Test Chart (v1.32.2, default)
🔇 Additional comments (9)
charts/operator-wandb/values.yaml (9)

289-292: LB health check path for api looks good

Path set to /ready matches readiness probe and provider-agnostic schema.


706-709: Console LB health check path set appropriately

Aligns with console readiness endpoint.


1296-1299: frontend LB health check path set; good alignment

Matches service’s readiness endpoint.


1834-1834: Weave service: LB disabled explicitly

Explicit disable is clearer than relying on provider defaults.


1988-1988: weave-trace liveness failureThreshold: 10

Reasonable for transient spikes; no action.


1999-2000: weave-trace startupProbe: 5s/24 cycles (~2 min)

Good balance between boot time and failure responsiveness.


2010-2012: weave-trace LB health check path set

Matches app health endpoint; aligns with base LB schema.


553-571: No action required: shutdown budget covered
terminationGracePeriodSeconds defaults to 75 s in wandb-base (charts/wandb-base/values.yaml), exceeding preStop 25 s + GORILLA_HTTP_SHUTDOWN_GRACE_PERIOD 15 s (40 s total), leaving ~35 s for load-balancer drain.


1928-1931: Validate cluster node memory capacity
Ensure your Kubernetes nodes have at least 48 GiB of allocatable memory (limit) and can handle a 32 GiB request per pod without causing OOM/kube-scheduler failures.
Location: charts/operator-wandb/values.yaml:1928-1931

Comment on lines +199 to +204
loadBalancer:
timeoutSeconds: 120
connectionDraining:
enabled: true
drainingTimeoutSeconds: 60

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Key name mismatch: timeoutSeconds/drainingTimeoutSeconds won’t be consumed by base templates

Base expects timeoutSec/drainingTimeoutSec. Current keys will be ignored, breaking LB timeouts/connection draining.

Apply:

   loadBalancer:
-    timeoutSeconds: 120
+    timeoutSec: 120
     connectionDraining:
       enabled: true
-      drainingTimeoutSeconds: 60
+      drainingTimeoutSec: 60

Then verify no remaining mismatches:


🏁 Script executed:

#!/bin/bash
rg -nP '\btimeoutSeconds\b|drainingTimeoutSeconds\b'
rg -nP 'Values\.global\.loadBalancer\.(timeoutSec|drainingTimeoutSec)' -C2

Length of output: 24793


Update loadBalancer keys and template references to use timeoutSec/drainingTimeoutSec

  • In charts/operator-wandb/values.yaml (lines 199–204), rename
      timeoutSeconds: 120       → timeoutSec: 120  
      connectionDraining:  
        drainingTimeoutSeconds: 60  → drainingTimeoutSec: 60
  • In charts/operator-wandb/templates/ingress.yaml (lines 61, 63, 121, 123), change
      .Values.global.loadBalancer.timeoutSeconds.Values.global.loadBalancer.timeoutSec  
      .Values.global.loadBalancer.connectionDraining.drainingTimeoutSeconds.Values.global.loadBalancer.connectionDraining.drainingTimeoutSec
  • In charts/wandb-base/templates/service.yaml (lines 18, 22, 45, 50), update
      $loadBalancer.healthCheck.timeoutSeconds$loadBalancer.healthCheck.timeoutSec  
      $loadBalancer.connectionDraining.drainingTimeoutSeconds$loadBalancer.connectionDraining.drainingTimeoutSec

After applying, re-run your ripgrep to confirm no remaining timeoutSeconds or drainingTimeoutSeconds references.

🤖 Prompt for AI Agents
In charts/operator-wandb/values.yaml lines 199–204, rename the loadBalancer keys
timeoutSeconds → timeoutSec and connectionDraining.drainingTimeoutSeconds →
connectionDraining.drainingTimeoutSec; then update the template references in
charts/operator-wandb/templates/ingress.yaml (around lines 61, 63, 121, 123)
replacing .Values.global.loadBalancer.timeoutSeconds →
.Values.global.loadBalancer.timeoutSec and
.Values.global.loadBalancer.connectionDraining.drainingTimeoutSeconds →
.Values.global.loadBalancer.connectionDraining.drainingTimeoutSec; also update
charts/wandb-base/templates/service.yaml (around lines 18, 22, 45, 50) replacing
$loadBalancer.healthCheck.timeoutSeconds → $loadBalancer.healthCheck.timeoutSec
and $loadBalancer.connectionDraining.drainingTimeoutSeconds →
$loadBalancer.connectionDraining.drainingTimeoutSec; after making these changes
run ripgrep to verify there are no remaining timeoutSeconds or
drainingTimeoutSeconds occurrences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants