Skip to content

Conversation

GuanLuo
Copy link
Contributor

@GuanLuo GuanLuo commented Oct 14, 2025

Signed-off-by: Guan Luo [email protected]

Overview:

To support Triton Inference Server <-> Dynamo migration, this PR enhance Dynamo gRPC frontend to return output data in raw_output_contents as expected by Triton gRPC client.

Note that although Triton gRPC client always send / receive data via raw_input_contents / raw_output_contents field. For completeness, Dynamo gRPC frontend will return data in the same way as the input data, i.e. output data will be returned in raw_output_contents if and only if input data is received in raw_input_contents

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Added support for returning raw tensor output contents, enabling easier integration with tools that consume byte-level results.
    • Unified behavior across standard and streaming tensor inference for consistent outputs.
  • Refactor

    • Improved tensor inference response handling to reliably propagate output-format preferences, enhancing correctness across pipelines.
  • Tests

    • Added comprehensive tensor-model tests, including dtype validation and raw-output verification, improving reliability and preventing regressions.

@GuanLuo GuanLuo requested a review from a team as a code owner October 14, 2025 01:46
Copy link

copy-pr-bot bot commented Oct 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the fix label Oct 14, 2025
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

Walkthrough

Introduces ExtendedNvCreateTensorResponse to wrap NvCreateTensorResponse and carry a to_raw_output_contents flag. Updates model_infer and model_stream_infer flows to use the wrapper, adjusts TryFrom conversions, adds two helper methods on ModelInferResponse, and extends tests to validate raw output contents and tensor dtypes.

Changes

Cohort / File(s) Summary
Tensor response wrapper and conversions
lib/llm/src/grpc/service/tensor.rs
Adds ExtendedNvCreateTensorResponse { response, to_raw_output_contents }. Replaces TryFrom with TryFrom for ModelInferResponse and ModelStreamInferResponse. Adds ModelInferResponse::add_raw_output_contents and ::fill_last_tensor_contents. Updates construction logic to branch on to_raw_output_contents.
KServe service integration
lib/llm/src/grpc/service/kserve.rs
Propagates to_raw_output_contents from request (based on raw_input_contents presence). Wraps NvCreateTensorResponse into ExtendedNvCreateTensorResponse in both model_infer and model_stream_infer paths before converting to responses. Existing non-tensor/OpenAI paths unchanged.
Tests for tensor flows
lib/llm/tests/kserve_service.rs
Adds TestPort::TensorModelTypes (8996) and fixture int_input. Introduces test_tensor_infer_dtypes. Extends validate_tensor_response to accept expected_raw_outputs map and asserts raw_output_contents when provided. Updates callers accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant KServe as KServeService
  participant Backend as Tensor Backend
  participant Wrap as ExtendedNvCreateTensorResponse
  participant Build as Response Builder

  Client->>KServe: ModelInfer (tensor) request
  Note over Client,KServe: Request may include raw_input_contents
  KServe->>Backend: CreateTensor / Infer
  Backend-->>KServe: NvCreateTensorResponse
  KServe->>Wrap: Wrap + set to_raw_output_contents
  Note right of Wrap: to_raw_output_contents = raw_input_contents.present()
  KServe->>Build: TryFrom(ExtendedNvCreateTensorResponse)
  alt to_raw_output_contents == true
    Build->>Build: add_raw_output_contents(...)
  else to_raw_output_contents == false
    Build->>Build: fill_last_tensor_contents(...)
  end
  Build-->>Client: ModelInferResponse
Loading
sequenceDiagram
  autonumber
  actor Client
  participant KServe as KServeService
  participant Stream as Stream Loop
  participant Wrap as ExtendedNvCreateTensorResponse
  participant Build as Stream Response Builder

  Client->>KServe: ModelStreamInfer (tensor)
  loop for each chunk
    KServe->>Stream: receive tensor chunk
    Stream-->>KServe: NvCreateTensorResponse (chunk)
    KServe->>Wrap: Wrap + set to_raw_output_contents
    KServe->>Build: TryFrom(ExtendedNvCreateTensorResponse)
    Build-->>Client: ModelStreamInferResponse (chunk)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit wraps a tensor tight,
With flags to guide its streaming flight.
Raw bytes hop through with careful sense,
Or nestle in the last contents.
Tests nibble dtypes, munch by munch—
All carrots green, a tidy lunch. 🥕🐇

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description uses the required template headings but leaves the Details and Where should the reviewer start sections empty and includes a Signed-off-by line outside the template, so it does not yet document the specific code changes or guide the reviewer to key files. Please populate the Details section with a summary of the specific code changes, fill in the Where should the reviewer start section with key file paths, and relocate the Signed-off-by line to the commit metadata rather than the PR description.
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes the core change of enhancing the gRPC frontend to return output in the raw content field for Triton client compatibility and aligns with the PR’s main objective without extraneous details.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/llm/tests/kserve_service.rs (1)

268-280: Align tensor shape with provided contents.

int_input defaults to three elements but reports a shape of [1], which can hide mismatches during validation. Consider deriving the shape from the fixture payload to keep expectations consistent.

         inference::model_infer_request::InferInputTensor {
             name: "int_input".into(),
             datatype: "UINT32".into(),
-            shape: vec![1],
+            shape: vec![input.len() as i64],
             contents: Some(inference::InferTensorContents {
                 uint_contents: input,
                 ..Default::default()
             }),
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd2389b and ba0afad.

📒 Files selected for processing (3)
  • lib/llm/src/grpc/service/kserve.rs (4 hunks)
  • lib/llm/src/grpc/service/tensor.rs (3 hunks)
  • lib/llm/tests/kserve_service.rs (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
lib/llm/src/grpc/service/tensor.rs (2)
lib/llm/src/grpc/service/openai.rs (3)
  • try_from (194-319)
  • try_from (325-374)
  • try_from (380-391)
lib/llm/src/protocols/tensor.rs (1)
  • len (80-95)
lib/llm/src/grpc/service/kserve.rs (2)
lib/llm/src/grpc/service/tensor.rs (4)
  • tensor_response_stream (52-128)
  • try_from (199-246)
  • try_from (469-496)
  • try_from (616-627)
lib/llm/src/protocols/tensor.rs (1)
  • from_annotated_stream (249-292)
lib/llm/tests/kserve_service.rs (3)
lib/llm/src/discovery/model_manager.rs (3)
  • default (51-53)
  • default (345-351)
  • new (57-66)
lib/llm/src/model_card.rs (2)
  • with_name_only (190-196)
  • name (223-225)
lib/llm/src/protocols/tensor.rs (2)
  • data_type (97-112)
  • len (80-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
lib/llm/src/grpc/service/kserve.rs (1)

198-213: Nice mirror of raw input semantics.

Capturing to_raw_output_contents once and threading it through the extended tensor response cleanly aligns our behavior with Triton clients that always operate on raw buffers. This keeps the typed path untouched while guaranteeing raw outputs only when the request asked for them.

lib/llm/src/grpc/service/tensor.rs (1)

500-524: Solid raw-output encoder.

The per-variant handling here preserves Triton’s little-endian expectations and keeps BYTES tensors length-prefixed, so clients reading raw_output_contents will get a faithful mirror of what they sent.

Signed-off-by: Guan Luo <[email protected]>
@GuanLuo
Copy link
Contributor Author

GuanLuo commented Oct 14, 2025

/ok to test ec12389

1 similar comment
@GuanLuo
Copy link
Contributor Author

GuanLuo commented Oct 14, 2025

/ok to test ec12389

@GuanLuo
Copy link
Contributor Author

GuanLuo commented Oct 14, 2025

/ok to test 7b964c0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant