fix: enhance gRPC frontend to return output in raw content field for Triton client compatibility #3600

GuanLuo · 2025-10-14T01:46:10Z

Signed-off-by: Guan Luo [email protected]

Overview:

To support Triton Inference Server <-> Dynamo migration, this PR enhance Dynamo gRPC frontend to return output data in raw_output_contents as expected by Triton gRPC client.

Note that although Triton gRPC client always send / receive data via raw_input_contents / raw_output_contents field. For completeness, Dynamo gRPC frontend will return data in the same way as the input data, i.e. output data will be returned in raw_output_contents if and only if input data is received in raw_input_contents

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [BUG] KServe gRPC not returning proper response for data type other than Bytes #3495

Summary by CodeRabbit

New Features
- Added support for returning raw tensor output contents, enabling easier integration with tools that consume byte-level results.
- Unified behavior across standard and streaming tensor inference for consistent outputs.
Refactor
- Improved tensor inference response handling to reliably propagate output-format preferences, enhancing correctness across pipelines.
Tests
- Added comprehensive tensor-model tests, including dtype validation and raw-output verification, improving reliability and preventing regressions.

…Triton client compatibility Signed-off-by: Guan Luo <[email protected]>

copy-pr-bot · 2025-10-14T01:46:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-14T01:52:47Z

Walkthrough

Introduces ExtendedNvCreateTensorResponse to wrap NvCreateTensorResponse and carry a to_raw_output_contents flag. Updates model_infer and model_stream_infer flows to use the wrapper, adjusts TryFrom conversions, adds two helper methods on ModelInferResponse, and extends tests to validate raw output contents and tensor dtypes.

Changes

Cohort / File(s)	Summary
Tensor response wrapper and conversions `lib/llm/src/grpc/service/tensor.rs`	Adds ExtendedNvCreateTensorResponse { response, to_raw_output_contents }. Replaces TryFrom with TryFrom for ModelInferResponse and ModelStreamInferResponse. Adds ModelInferResponse::add_raw_output_contents and ::fill_last_tensor_contents. Updates construction logic to branch on to_raw_output_contents.
KServe service integration `lib/llm/src/grpc/service/kserve.rs`	Propagates to_raw_output_contents from request (based on raw_input_contents presence). Wraps NvCreateTensorResponse into ExtendedNvCreateTensorResponse in both model_infer and model_stream_infer paths before converting to responses. Existing non-tensor/OpenAI paths unchanged.
Tests for tensor flows `lib/llm/tests/kserve_service.rs`	Adds TestPort::TensorModelTypes (8996) and fixture int_input. Introduces test_tensor_infer_dtypes. Extends validate_tensor_response to accept expected_raw_outputs map and asserts raw_output_contents when provided. Updates callers accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant KServe as KServeService
  participant Backend as Tensor Backend
  participant Wrap as ExtendedNvCreateTensorResponse
  participant Build as Response Builder

  Client->>KServe: ModelInfer (tensor) request
  Note over Client,KServe: Request may include raw_input_contents
  KServe->>Backend: CreateTensor / Infer
  Backend-->>KServe: NvCreateTensorResponse
  KServe->>Wrap: Wrap + set to_raw_output_contents
  Note right of Wrap: to_raw_output_contents = raw_input_contents.present()
  KServe->>Build: TryFrom(ExtendedNvCreateTensorResponse)
  alt to_raw_output_contents == true
    Build->>Build: add_raw_output_contents(...)
  else to_raw_output_contents == false
    Build->>Build: fill_last_tensor_contents(...)
  end
  Build-->>Client: ModelInferResponse

sequenceDiagram
  autonumber
  actor Client
  participant KServe as KServeService
  participant Stream as Stream Loop
  participant Wrap as ExtendedNvCreateTensorResponse
  participant Build as Stream Response Builder

  Client->>KServe: ModelStreamInfer (tensor)
  loop for each chunk
    KServe->>Stream: receive tensor chunk
    Stream-->>KServe: NvCreateTensorResponse (chunk)
    KServe->>Wrap: Wrap + set to_raw_output_contents
    KServe->>Build: TryFrom(ExtendedNvCreateTensorResponse)
    Build-->>Client: ModelStreamInferResponse (chunk)
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit wraps a tensor tight,
With flags to guide its streaming flight.
Raw bytes hop through with careful sense,
Or nestle in the last contents.
Tests nibble dtypes, munch by munch—
All carrots green, a tidy lunch. 🥕🐇

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description uses the required template headings but leaves the Details and Where should the reviewer start sections empty and includes a Signed-off-by line outside the template, so it does not yet document the specific code changes or guide the reviewer to key files.	Please populate the Details section with a summary of the specific code changes, fill in the Where should the reviewer start section with key file paths, and relocate the Signed-off-by line to the commit metadata rather than the PR description.
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly describes the core change of enhancing the gRPC frontend to return output in the raw content field for Triton client compatibility and aligns with the PR’s main objective without extraneous details.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

lib/llm/tests/kserve_service.rs (1)
268-280: Align tensor shape with provided contents.

int_input defaults to three elements but reports a shape of [1], which can hide mismatches during validation. Consider deriving the shape from the fixture payload to keep expectations consistent.
         inference::model_infer_request::InferInputTensor {
             name: "int_input".into(),
             datatype: "UINT32".into(),
-            shape: vec![1],
+            shape: vec![input.len() as i64],
             contents: Some(inference::InferTensorContents {
                 uint_contents: input,
                 ..Default::default()
             }),

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd2389b and ba0afad.

📒 Files selected for processing (3)

lib/llm/src/grpc/service/kserve.rs (4 hunks)
lib/llm/src/grpc/service/tensor.rs (3 hunks)
lib/llm/tests/kserve_service.rs (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

lib/llm/src/grpc/service/tensor.rs (2)

lib/llm/src/grpc/service/openai.rs (3)

try_from (194-319)

try_from (325-374)

try_from (380-391)

lib/llm/src/protocols/tensor.rs (1)

len (80-95)

lib/llm/src/grpc/service/kserve.rs (2)

lib/llm/src/grpc/service/tensor.rs (4)

tensor_response_stream (52-128)

try_from (199-246)

try_from (469-496)

try_from (616-627)

lib/llm/src/protocols/tensor.rs (1)

from_annotated_stream (249-292)

lib/llm/tests/kserve_service.rs (3)

lib/llm/src/discovery/model_manager.rs (3)

default (51-53)

default (345-351)

new (57-66)

lib/llm/src/model_card.rs (2)

with_name_only (190-196)

name (223-225)

lib/llm/src/protocols/tensor.rs (2)

data_type (97-112)

len (80-95)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: clippy (launch/dynamo-run)
GitHub Check: clippy (.)
GitHub Check: clippy (lib/runtime/examples)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

lib/llm/src/grpc/service/kserve.rs (1)

198-213: Nice mirror of raw input semantics.

Capturing to_raw_output_contents once and threading it through the extended tensor response cleanly aligns our behavior with Triton clients that always operate on raw buffers. This keeps the typed path untouched while guaranteeing raw outputs only when the request asked for them.

lib/llm/src/grpc/service/tensor.rs (1)

500-524: Solid raw-output encoder.

The per-variant handling here preserves Triton’s little-endian expectations and keeps BYTES tensors length-prefixed, so clients reading raw_output_contents will get a faithful mirror of what they sent.

Signed-off-by: Guan Luo <[email protected]>

GuanLuo · 2025-10-14T08:37:32Z

/ok to test ec12389

GuanLuo · 2025-10-14T21:45:10Z

/ok to test ec12389

…roper-response-for-data-type

GuanLuo · 2025-10-14T21:48:37Z

/ok to test 7b964c0

fix: enhance gRPC frontend to return output in raw content field for …

ba0afad

…Triton client compatibility Signed-off-by: Guan Luo <[email protected]>

GuanLuo requested a review from a team as a code owner October 14, 2025 01:46

pull-request-size bot added the size/L label Oct 14, 2025

github-actions bot added the fix label Oct 14, 2025

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

chore: style

ec12389

Signed-off-by: Guan Luo <[email protected]>

Merge branch 'main' into gluo/dis-776-bug-kserve-grpc-not-returning-p…

7b964c0

…roper-response-for-data-type

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 21:48 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 21:50 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: enhance gRPC frontend to return output in raw content field for Triton client compatibility #3600

fix: enhance gRPC frontend to return output in raw content field for Triton client compatibility #3600

GuanLuo commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 14, 2025

Uh oh!

coderabbitai bot commented Oct 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

GuanLuo commented Oct 14, 2025

Uh oh!

GuanLuo commented Oct 14, 2025

Uh oh!

GuanLuo commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: enhance gRPC frontend to return output in raw content field for Triton client compatibility #3600

Are you sure you want to change the base?

fix: enhance gRPC frontend to return output in raw content field for Triton client compatibility #3600

Conversation

GuanLuo commented Oct 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 14, 2025

Uh oh!

coderabbitai bot commented Oct 14, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

GuanLuo commented Oct 14, 2025

Uh oh!

GuanLuo commented Oct 14, 2025

Uh oh!

GuanLuo commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GuanLuo commented Oct 14, 2025 •

edited by coderabbitai bot

Loading