feat: vllm data parallelism multi-node support in operator #3595

tmonty12 · 2025-10-13T21:46:45Z

Overview:

Previously, only support for vllm TP multi-node existed within the operator using Ray. This PR adds support for the case where data parallelism ranks span across multiple nodes.

Details:

deploy/cloud/operator/internal/dynamo/backend_sglang.go: moves logic for handling shell interpolation into file utils.go so that it can also be leveraged by backend_vllm.go.
deploy/cloud/operator/internal/dynamo/backend_vllm.go: if world size (TP * PP) > container GPU count - use Ray launch strategy. if world size across DP ranks (TP * PP * DP) > container GPU count - need to inject data parallel flags for multi-node coordination.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- vLLM multinode support enhanced with automatic mode detection (Ray-based vs data-parallel) and correct leader/worker behavior.
Documentation
- Expanded multinode deployment guidance, mode explanations, probe behavior, and data-parallel RPC port details.
Refactor
- Removed deprecated CRD fields from component and graph deployment schemas.
- Unified and simplified container flag injection across backends for more consistent multinode launches.
Tests
- Updated tests to reflect new multinode argument handling and shell-command scenarios.

coderabbitai · 2025-10-13T21:58:16Z

Walkthrough

Centralizes multinode flag injection into a shared utility; SGLang delegates to it. VLLM adds conditional Ray vs. data-parallel injection, world-size/GPU parsing, and a new UpdateContainer parameter for leader resolution. CRDs remove dynamoComponent/dynamoTag and dynamoGraph. Tests and docs updated for the new flows.

Changes

Cohort / File(s)	Summary
SGLang backend refactor `deploy/cloud/operator/internal/dynamo/backend_sglang.go`	Replaced inline multinode flag injection with `injectFlagsIntoContainerCommand`; removed legacy python-specific injection and commented code; `UpdatePodSpec` made no-op; trimmed imports.
VLLM backend multinode overhaul `deploy/cloud/operator/internal/dynamo/backend_vllm.go`	Added mode-aware injection selecting Ray-based or data-parallel launches; added arg expansion, leader hostname resolution, GPU/world-size and rank helpers; introduced new constants/flags and updated `UpdateContainer` signature to accept `multinodeDeployer`.
VLLM tests updated `deploy/cloud/operator/internal/dynamo/backend_vllm_test.go`	Refactored tests to use single `corev1.Container` as input; added shell-command and multinode injection coverage; adjusted imports and resource/GPU test setup; updated expectations to compare `container.Args`.
Shared injection utilities `deploy/cloud/operator/internal/dynamo/utils.go`	Added `injectFlagsIntoContainerCommand` and `injectFlagsIntoPythonCommand` (regex-based) to inject flags into Python or shell-wrapped commands; supports shell wrapping with `sh -c` + `exec` when needed and per-arg injection attempts for non-direct Python commands.
CRD removals and schemas `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml`	Removed public fields: `dynamoComponent` and `dynamoTag` from DynamoComponentDeployment CRD; removed `dynamoGraph` from DynamoGraphDeployment CRD; updated several field descriptions (envs/services) to clarify deployment semantics.
Docs updated `deploy/cloud/operator/docs/footer.md`, `docs/kubernetes/multinode-deployment.md`	Documented VLLM deployment modes (Ray-Based vs Data Parallel), probe behavior, injected flags (including data-parallel RPC port 13445), and operator injection behavior; revised mode-based guidance.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Op as Operator
  participant V as VLLMBackend
  participant M as MultinodeDeployer
  participant U as Utils (injection)

  Op->>V: UpdateContainer(container, nodes, role, comp, svcName, multinodeDeployer)
  V->>V: getWorldSize(), getContainerGPUs(), getFlagValue()
  alt Ray-based distributed launch
    V->>M: ResolveLeaderHostname()
    M-->>V: headAddress
    V->>U: injectFlagsIntoContainerCommand(... Ray flags ..., needsShell=true)
    U-->>V: container updated
  else Data-parallel launch
    V->>V: compute start-rank, dp size/local, RPC port
    V->>U: injectFlagsIntoContainerCommand(... data-parallel flags ..., needsShell=maybe)
    U-->>V: container updated
  else No injection needed
    V-->>V: leave args unchanged
  end
  V-->>Op: updated container

sequenceDiagram
  autonumber
  participant Op as Operator
  participant S as SGLangBackend
  participant U as Utils (injection)

  Op->>S: UpdateContainer(container, ...)
  S->>U: injectFlagsIntoContainerCommand(container, flags, needsShell, "sglang")
  U-->>S: container updated
  S-->>Op: updated container

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A whisk of flags, a hop through nodes,
I stitch the ranks on training roads.
Ray or data-parallel I choose with care,
I wrap the shell and place flags there.
Small paws, big changes—deployments bloom! 🐇⚙️

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description follows the repository’s template by providing Overview, Details, and Related Issues sections, but leaves the “Where should the reviewer start?” section empty with only the placeholder comment instead of listing the specific files or code areas to examine. This omission may hinder reviewers’ ability to focus on critical parts of the changeset.	Please populate the “Where should the reviewer start?” section with the relevant file paths or code areas that should be prioritized for review, replacing the placeholder comment with concrete guidance.
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title concisely describes the primary feature addition—vLLM data-parallel multi-node support in the operator—using clear and specific terminology aligned with the changeset. It highlights the new capability without including extraneous details, making it easily understandable at a glance.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2025-10-13T22:00:51Z

Walkthrough

CRD schemas were updated to remove certain public fields and adjust descriptions. Operator logic refactors introduce shared flag-injection utilities, simplify SGLang handling, and add VLLM multinode launch paths (Ray-based and data-parallel), with corresponding tests. Documentation was updated to explain multinode modes, probe behavior, and ports.

Changes

Cohort / File(s)	Summary of Changes
CRD: DynamoComponentDeployment (templates + bases) `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml`	Removed spec fields: `dynamoComponent`, `dynamoTag`. No new fields added.
CRD: DynamoGraphDeployment (templates + bases) `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml`	Removed `spec.backendFramework.dynamoGraph`. Updated descriptions: `envs` scope wording and simplified `services` description.
Operator Runtime: SGLang refactor + shared utils `deploy/cloud/operator/internal/dynamo/backend_sglang.go`, `deploy/cloud/operator/internal/dynamo/utils.go`	Moved flag-injection logic into new utilities: `injectFlagsIntoContainerCommand`, `injectFlagsIntoPythonCommand`. `backend_sglang.go` now delegates; `UpdatePodSpec` becomes no-op.
Operator Runtime: VLLM multinode logic + tests `deploy/cloud/operator/internal/dynamo/backend_vllm.go`, `deploy/cloud/operator/internal/dynamo/backend_vllm_test.go`	Added Ray-based and data-parallel multinode handling, GPU/count and world-size helpers, and compilation cache env support. Updated tests to new arg initialization and multinode scenarios.
Docs: Multinode behavior and ports `deploy/cloud/operator/docs/footer.md`, `docs/kubernetes/multinode-deployment.md`	Documented Ray-based vs data-parallel modes, probe behavior, data-parallel RPC port (13445), and clarified Ray head port scope.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User as Deployment Spec
    participant Operator as Operator (VLLM Backend)
    participant Node as Pod/Container
    participant Utils as Flag Injection Utils

    Note over Operator: UpdateContainer for VLLM
    User->>Operator: Provide args, role, resources
    Operator->>Operator: Compute worldSize, GPUsPerNode
    alt Ray-based multinode
        Operator->>Node: Leader: prepend "ray start --head" to command
        Operator->>Node: Worker: prepend "ray start --address=<leader>"
    else Data-parallel multinode
        Operator->>Utils: injectFlagsIntoContainerCommand(..., "--dp-address ... --dp-size-local ... --rpc-port 13445 --start-rank <n>", needsShell)
        Utils-->>Operator: Container updated
    else Single-node / no-op
        Operator->>Operator: Log no injected flags
    end
    Note over Node: Probes adjusted per role/mode (as documented)

sequenceDiagram
    autonumber
    participant Backend as Backend (SGLang/VLLM)
    participant Utils as injectFlagsIntoContainerCommand
    participant K8s as corev1.Container

    Backend->>Utils: Inject flags into container (flags, needsShell)
    alt Python entrypoint
        alt needsShell
            Utils->>K8s: Wrap with ["sh","-c"] and exec-preserving string
        else direct
            Utils->>K8s: Append split flags to Args
        end
    else Non-Python command
        Utils->>K8s: Regex-inject flags into first python ... sglang occurrence within Args
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paw at dawn’s new art,
Fields trimmed from CRDs, a cleaner start.
Ray or ranks, I hop the lanes,
Flags get tucked in tidy chains.
Probes stand guard, ports align—
Multinode meadows, brisk and fine.
Cache burrows ready—ship on time! 🐇✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description follows the required template by including Overview, Details, and Related Issues sections but fails to provide any content under "Where should the reviewer start?", and also retains a placeholder issue number instead of a real reference, leaving key reviewer guidance and issue linking incomplete.	Please populate the "Where should the reviewer start?" section with specific file paths and key areas to inspect, and replace the placeholder issue number with the actual GitHub issue reference to fully satisfy the template requirements.
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly describes the primary feature added—enabling vLLM data parallel multi-node support in the operator—which directly matches the main change objectives without extra detail or ambiguity.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (9)

deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)
18-27: Remove unused helper (and import) to avoid dead code.

isPythonCommand is no longer referenced; drop it and the regexp import to keep the file lean.
-import (
-	"fmt"
-	"regexp"
+import (
+	"fmt"
 	"github.com/ai-dynamo/dynamo/deploy/cloud/operator/api/v1alpha1"
 	corev1 "k8s.io/api/core/v1"
 	"sigs.k8s.io/controller-runtime/pkg/log"
 )
@@
-// isPythonCommand checks if the command is a Python interpreter
-func isPythonCommand(cmd string) bool {
-	if cmd == "python" || cmd == "python3" {
-		return true
-	}
-	// Match python with version numbers like python3.11, python2.7, etc.
-	// Also support absolute paths like /usr/bin/python3.8, /opt/python/bin/python3.11
-	matched, _ := regexp.MatchString(`^(.*/)?(python\d*(\.\d+)*)$`, cmd)
-	return matched
-}
docs/kubernetes/multinode-deployment.md (2)
191-202: Use headings instead of emphasis for section titles (MD036).

Replace bolded mode titles with proper subheadings for readability and lint compliance.
-**1. Ray-Based Mode (Tensor/Pipeline Parallelism across nodes)**
+#### Ray-Based Mode (Tensor/Pipeline Parallelism across nodes)
203-216: Same here: convert emphasized title to a heading (MD036).

Switch to a subheading for Data Parallel Mode.
-**2. Data Parallel Mode (Multiple model instances across nodes)**
+#### Data Parallel Mode (Multiple model instances across nodes)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (3)
93-99: Fix tautological assertion; snapshot args before calling UpdateContainer.

Current check compares the mutated slice to itself and will always pass.
-			// Call UpdateContainer
-			backend.UpdateContainer(tt.initialContainer, tt.numberOfNodes, tt.role, tt.component, "test-service", tt.multinodeDeployer)
-
-			if tt.expectNotModified {
-				// Args should not have changed
-				g.Expect(tt.initialContainer.Args).To(gomega.Equal(tt.initialContainer.Args))
+			// Snapshot args before mutation
+			priorArgs := append([]string{}, tt.initialContainer.Args...)
+			backend.UpdateContainer(tt.initialContainer, tt.numberOfNodes, tt.role, tt.component, "test-service", tt.multinodeDeployer)
+			if tt.expectNotModified {
+				g.Expect(tt.initialContainer.Args).To(gomega.Equal(priorArgs))
37-45: Leader probes should not be asserted as removed; initialize probes to verify behavior.

Leader probes remain active per design; the test should not mark expectProbesRemoved for leader. Also set non-nil probes to actually test removal for workers.
-			name:                "multinode leader prepends ray start --head",
+			name:                "multinode leader prepends ray start --head (probes unchanged)",
@@
-			expectProbesRemoved: true,
+			expectProbesRemoved: false,
Follow-up: Initialize Liveness/Readiness/Startup probes in worker cases to validate removal.

252-264: Add a test for non-divisible GPUs/world size (rounding semantics).

When GPUs_per_node is not divisible by TP×PP, dataParallelSizeLocal truncates. Add a case to document expected behavior (or enforce divisibility).
deploy/cloud/operator/internal/dynamo/backend_vllm.go (3)
31-36: Fix misleading comment (leader probes aren’t removed).

The code removes probes only for workers; update the comment to avoid confusion.
-		// Remove probes for multinode worker and leader
+		// Remove probes for multinode workers
90-99: Validate GPU/parallelism divisibility before DP injection.

dataParallelSizeLocal uses integer division; if GPUs_per_node is not divisible by TP×PP, local size truncates and can underutilize GPUs. Warn or fail fast.
-		dataParallelSizeLocal := getContainerGPUs(container) / getWorldSize(container)
+		containerGPUs := getContainerGPUs(container)
+		worldSize := getWorldSize(container)
+		if worldSize == 0 {
+			worldSize = 1
+		}
+		if containerGPUs%worldSize != 0 {
+			log.Log.WithName("vllm-backend").Info("GPUs_per_node not divisible by TP×PP; truncating data-parallel workers per node",
+				"gpusPerNode", containerGPUs, "worldSize", worldSize)
+		}
+		dataParallelSizeLocal := containerGPUs / worldSize
131-143: Consider supporting --flag=value style parsing.

getFlagValue only handles '--flag value'. Many CLIs accept '--flag=value'. Supporting both improves robustness.
 	for i, arg := range container.Args {
-		if arg == flag && (i+1 < len(container.Args)) {
+		if arg == flag && (i+1 < len(container.Args)) {
 			flagValue, err := strconv.ParseInt(container.Args[i+1], 10, 64)
 			if err != nil {
 				continue
 			}
 			return flagValue
+		}
+		// Support --flag=value form
+		if strings.HasPrefix(arg, flag+"=") {
+			val := strings.TrimPrefix(arg, flag+"=")
+			if v, err := strconv.ParseInt(val, 10, 64); err == nil {
+				return v
+			}
 		}
 	}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd2389b and fa5f9ee.

📒 Files selected for processing (10)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2 hunks)
deploy/cloud/operator/docs/footer.md (2 hunks)
deploy/cloud/operator/internal/dynamo/backend_sglang.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (2 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (9 hunks)
deploy/cloud/operator/internal/dynamo/utils.go (1 hunks)
docs/kubernetes/multinode-deployment.md (1 hunks)

💤 Files with no reviewable changes (2)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-25T22:34:11.384Z

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

deploy/cloud/operator/docs/footer.md

🧬 Code graph analysis (3)

deploy/cloud/operator/internal/dynamo/backend_vllm.go (2)

deploy/cloud/operator/internal/dynamo/graph.go (4)

Role (559-559)

MultinodeDeployer (615-619)

RoleLeader (562-562)

RoleWorker (563-563)

deploy/cloud/operator/internal/consts/consts.go (1)

KubeResourceGPUNvidia (47-47)

deploy/cloud/operator/internal/dynamo/utils.go (1)

benchmarks/profiler/utils/config.py (1)

Container (49-54)

deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (3)

deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)

VLLMPort (15-15)

deploy/cloud/operator/internal/dynamo/graph.go (5)

Role (559-559)

MultinodeDeployer (615-619)

RoleMain (564-564)

RoleLeader (562-562)

RoleWorker (563-563)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)

DynamoComponentDeploymentSharedSpec (48-106)

🪛 markdownlint-cli2 (0.18.1)

docs/kubernetes/multinode-deployment.md

191-191: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

203-203: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Validate PR title and add label
GitHub Check: Build and Test - dynamo

🔇 Additional comments (7)

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2)

69-70: Deployment-scoped env description reads correctly.
The revised wording matches the deployment-level semantics and clarifies override scope.

10386-10387: Concise services description LGTM.
The new phrasing is clear without altering schema semantics.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2)

69-70: Documentation wording is consistent.

The updated description correctly reflects deployment-wide env scope, matching the rest of the spec.

10386-10387: Services description looks good.

Wording is clearer without altering behavior.

deploy/cloud/operator/docs/footer.md (2)

98-107: Mode selection conditions look correct (world_size vs GPUs).

Docs align with code paths (Ray when TP×PP > GPUs; DP when TP×PP×DP > GPUs). LGTM.

204-206: Confirm whether port 13445 needs explicit containerPort/service exposure.

If DP coordination requires inbound connections to the leader, ensure:

container.ports includes 13445 (optional for Pod IP access, required if exposed via a Service).

Any Service definitions include this port when needed.

deploy/cloud/operator/internal/dynamo/backend_sglang.go (1)

62-62: Good refactor: centralizing flag injection.

Using injectFlagsIntoContainerCommand removes duplication and keeps behavior consistent across backends.

deploy/cloud/operator/internal/dynamo/backend_vllm.go

deploy/cloud/operator/internal/dynamo/utils.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

deploy/cloud/operator/internal/dynamo/utils.go (1)

31-61: Data-parallel injection still misses /usr/bin/env python launches

When the image uses an env-wrapper (command: ["/usr/bin/env", "python3"], args: ["-m", "vllm.entrypoints.api_server", …]), the first branch rejects it (isPythonCommand("/usr/bin/env") → false) and the fallback scans each arg in isolation. Because the interpreter and module tokens live in different slice entries, the regex never fires and we drop all distributed flags—which breaks the multinode DP flow for that common launch pattern, despite the earlier review note on the same gap.

Please treat env/python pairs as Python (or join command+args before attempting injection) so /usr/bin/env python* … receives the flags.

🧹 Nitpick comments (1)

deploy/cloud/operator/docs/footer.md (1)
97-106: Simplify redundant probe behavior documentation.

Both Ray-Based Mode and Data Parallel Mode specify identical probe behavior (worker nodes have all probes removed, leader nodes keep all probes active). Consider consolidating this into a single statement to reduce redundancy and improve readability.

Apply this diff to simplify the documentation:
 #### VLLM Backend
 
-The operator automatically selects between two deployment modes based on parallelism configuration:
-
-**Ray-Based Mode** (when `world_size > GPUs_per_node`):
-- **Worker nodes**: All probes (liveness, readiness, startup) are removed
-- **Leader nodes**: All probes remain active
-
-**Data Parallel Mode** (when `world_size × data_parallel_size > GPUs_per_node`):
+The operator automatically selects between Ray-Based Mode (when `world_size > GPUs_per_node`) and Data Parallel Mode (when `world_size × data_parallel_size > GPUs_per_node`) based on parallelism configuration. For both modes:
+
 - **Worker nodes**: All probes (liveness, readiness, startup) are removed
 - **Leader nodes**: All probes remain active

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd2389b and b0c20d0.

📒 Files selected for processing (10)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (0 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2 hunks)
deploy/cloud/operator/docs/footer.md (2 hunks)
deploy/cloud/operator/internal/dynamo/backend_sglang.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm.go (2 hunks)
deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (9 hunks)
deploy/cloud/operator/internal/dynamo/utils.go (1 hunks)
docs/kubernetes/multinode-deployment.md (1 hunks)

💤 Files with no reviewable changes (2)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-25T22:34:11.384Z

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

deploy/cloud/operator/docs/footer.md

🧬 Code graph analysis (3)

deploy/cloud/operator/internal/dynamo/backend_vllm.go (2)

deploy/cloud/operator/internal/dynamo/graph.go (4)

Role (559-559)

MultinodeDeployer (615-619)

RoleLeader (562-562)

RoleWorker (563-563)

deploy/cloud/operator/internal/consts/consts.go (1)

KubeResourceGPUNvidia (47-47)

deploy/cloud/operator/internal/dynamo/backend_vllm_test.go (5)

deploy/cloud/operator/internal/dynamo/backend_vllm.go (1)

VLLMPort (15-15)

deploy/cloud/operator/internal/dynamo/graph.go (6)

Role (559-559)

MultinodeDeployer (615-619)

RoleMain (564-564)

Resources (82-87)

RoleLeader (562-562)

RoleWorker (563-563)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)

DynamoComponentDeploymentSharedSpec (48-106)

deploy/cloud/operator/internal/dynamo/grove.go (1)

GroveMultinodeDeployer (23-25)

deploy/cloud/operator/internal/dynamo/lws.go (1)

LWSMultinodeDeployer (5-7)

deploy/cloud/operator/internal/dynamo/utils.go (1)

benchmarks/profiler/utils/config.py (1)

Container (49-54)

🪛 markdownlint-cli2 (0.18.1)

docs/kubernetes/multinode-deployment.md

191-191: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

203-203: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (5)

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2)

69-70: Env description clarification looks good

The updated wording cleanly scopes shared env vars versus service overrides. Thanks for tightening the docs.

10386-10387: Service description update is clear

The refreshed text plainly states the field’s role in the deployment spec. No concerns here.

deploy/cloud/operator/docs/footer.md (1)

204-205: Documentation aligns with PR objectives.

The addition of the Data Parallel RPC Port (13445) and clarification of the Ray Head Port usage correctly reflects the new multinode data-parallel support for VLLM.

deploy/cloud/operator/internal/dynamo/backend_sglang.go (2)

62-62: Refactoring improves maintainability.

Delegating flag injection to the shared injectFlagsIntoContainerCommand utility aligns with the PR objective of centralizing multinode shell interpolation logic and reduces code duplication.

19-27: Retain isPythonCommand; it’s still in use: calls appear in backend_trtllm.go (line 112), utils.go (line 31) and are exercised by backend_sglang_test.go.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-10-14T04:32:16Z

deploy/cloud/operator/internal/dynamo/backend_vllm_test.go

+			backend.UpdateContainer(tt.initialContainer, tt.numberOfNodes, tt.role, &v1alpha1.DynamoComponentDeploymentSharedSpec{}, "test-service", tt.multinodeDeployer)
+
+			if !reflect.DeepEqual(tt.initialContainer.Args, tt.expectedArgs) {
+				t.Errorf("UpdateContainer() args = %v, want %v", tt.initialContainer.Args, tt.expectedArgs)
+			}
+
+			expectedCommand := tt.initialContainer.Command
+			if !reflect.DeepEqual(tt.initialContainer.Command, expectedCommand) {
+				t.Errorf("UpdateContainer() should preserve shell command, got: %v, want: %v", tt.initialContainer.Command, expectedCommand)
 			}


⚠️ Potential issue | 🟠 Major

Shell-command equality check is a no-op

You snapshot expectedCommand after calling UpdateContainer, so it aliases the mutated slice; the subsequent comparison will never fail. Copy the slice before invoking the backend to actually assert command preservation.

- backend.UpdateContainer(tt.initialContainer, tt.numberOfNodes, tt.role, &v1alpha1.DynamoComponentDeploymentSharedSpec{}, "test-service", tt.multinodeDeployer) - - if !reflect.DeepEqual(tt.initialContainer.Args, tt.expectedArgs) { + originalCommand := append([]string{}, tt.initialContainer.Command...) + + backend.UpdateContainer(tt.initialContainer, tt.numberOfNodes, tt.role, &v1alpha1.DynamoComponentDeploymentSharedSpec{}, "test-service", tt.multinodeDeployer) + + if !reflect.DeepEqual(tt.initialContainer.Args, tt.expectedArgs) { t.Errorf("UpdateContainer() args = %v, want %v", tt.initialContainer.Args, tt.expectedArgs) } - expectedCommand := tt.initialContainer.Command - if !reflect.DeepEqual(tt.initialContainer.Command, expectedCommand) { - t.Errorf("UpdateContainer() should preserve shell command, got: %v, want: %v", tt.initialContainer.Command, expectedCommand) + if !reflect.DeepEqual(tt.initialContainer.Command, originalCommand) { + t.Errorf("UpdateContainer() should preserve shell command, got: %v, want: %v", tt.initialContainer.Command, originalCommand) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

backend.UpdateContainer(tt.initialContainer, tt.numberOfNodes, tt.role, &v1alpha1.DynamoComponentDeploymentSharedSpec{}, "test-service", tt.multinodeDeployer)

if !reflect.DeepEqual(tt.initialContainer.Args, tt.expectedArgs) {

t.Errorf("UpdateContainer() args = %v, want %v", tt.initialContainer.Args, tt.expectedArgs)

}

expectedCommand := tt.initialContainer.Command

if !reflect.DeepEqual(tt.initialContainer.Command, expectedCommand) {

t.Errorf("UpdateContainer() should preserve shell command, got: %v, want: %v", tt.initialContainer.Command, expectedCommand)

}

// Snapshot the original command before calling UpdateContainer

originalCommand := append([]string{}, tt.initialContainer.Command...)

backend.UpdateContainer(tt.initialContainer, tt.numberOfNodes, tt.role, &v1alpha1.DynamoComponentDeploymentSharedSpec{}, "test-service", tt.multinodeDeployer)

if !reflect.DeepEqual(tt.initialContainer.Args, tt.expectedArgs) {

t.Errorf("UpdateContainer() args = %v, want %v", tt.initialContainer.Args, tt.expectedArgs)

}

// Verify that the command slice was preserved

if !reflect.DeepEqual(tt.initialContainer.Command, originalCommand) {

t.Errorf("UpdateContainer() should preserve shell command, got: %v, want: %v", tt.initialContainer.Command, originalCommand)

}

🤖 Prompt for AI Agents

In deploy/cloud/operator/internal/dynamo/backend_vllm_test.go around lines 175 to 184, the test captures expectedCommand after calling UpdateContainer which aliases the mutated slice, making the comparison a no-op; copy the initial container.Command slice (e.g., make a new slice and copy contents) before invoking backend.UpdateContainer, then call UpdateContainer and compare the post-call container.Command to the copied expected slice to properly assert command preservation.

tmonty12 requested a review from a team as a code owner October 13, 2025 21:46

pull-request-size bot added the size/L label Oct 13, 2025

github-actions bot added the feat label Oct 13, 2025

tmonty12 added 12 commits October 13, 2025 14:48

init

b293b2c

tmp fix for interpolation

e8a212b

combine args into string for exec

762cee6

delete extra manifest

6facf5c

fix start rank

a456247

tmp fix

bbb9539

dynamic start rank

b3fedd9

update multinode vllm dp logic

00f8188

fix data parallel size local

4480411

add tests

cc2e450

doc updates

9ee0c98

more doc updates

fa5f9ee

tmonty12 force-pushed the tmonty12/vllm-multinode-dp-operator-support branch from 076f85f to fa5f9ee Compare October 13, 2025 21:49

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 21:49 Inactive

tmonty12 changed the title ~~feat: vllm data parallelism multi-node support~~ feat: vllm data parallelism multi-node support in operator Oct 13, 2025

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

deploy/cloud/operator/internal/dynamo/backend_vllm.go Outdated Show resolved Hide resolved

deploy/cloud/operator/internal/dynamo/utils.go Show resolved Hide resolved

deploy/cloud/operator/internal/dynamo/utils.go Outdated Show resolved Hide resolved

handle more complex flag injection cases

b9de579

pull-request-size bot removed the size/L label Oct 14, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 04:21 Inactive

pull-request-size bot added the size/XL label Oct 14, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 04:22 Inactive

address comments

b0c20d0

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 04:26 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 04:27 Inactive

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

fix test comparison

5bcdb5c

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 17:59 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 18:01 Inactive

hhzhang16 approved these changes Oct 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: vllm data parallelism multi-node support in operator #3595

feat: vllm data parallelism multi-node support in operator #3595

Uh oh!

tmonty12 commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 13, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: vllm data parallelism multi-node support in operator #3595

Are you sure you want to change the base?

feat: vllm data parallelism multi-node support in operator #3595

Uh oh!

Conversation

tmonty12 commented Oct 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot commented Oct 13, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tmonty12 commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading