-
Notifications
You must be signed in to change notification settings - Fork 641
fix: enhance gRPC frontend to return output in raw content field for Triton client compatibility #3600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…Triton client compatibility Signed-off-by: Guan Luo <[email protected]>
WalkthroughIntroduces ExtendedNvCreateTensorResponse to wrap NvCreateTensorResponse and carry a to_raw_output_contents flag. Updates model_infer and model_stream_infer flows to use the wrapper, adjusts TryFrom conversions, adds two helper methods on ModelInferResponse, and extends tests to validate raw output contents and tensor dtypes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant KServe as KServeService
participant Backend as Tensor Backend
participant Wrap as ExtendedNvCreateTensorResponse
participant Build as Response Builder
Client->>KServe: ModelInfer (tensor) request
Note over Client,KServe: Request may include raw_input_contents
KServe->>Backend: CreateTensor / Infer
Backend-->>KServe: NvCreateTensorResponse
KServe->>Wrap: Wrap + set to_raw_output_contents
Note right of Wrap: to_raw_output_contents = raw_input_contents.present()
KServe->>Build: TryFrom(ExtendedNvCreateTensorResponse)
alt to_raw_output_contents == true
Build->>Build: add_raw_output_contents(...)
else to_raw_output_contents == false
Build->>Build: fill_last_tensor_contents(...)
end
Build-->>Client: ModelInferResponse
sequenceDiagram
autonumber
actor Client
participant KServe as KServeService
participant Stream as Stream Loop
participant Wrap as ExtendedNvCreateTensorResponse
participant Build as Stream Response Builder
Client->>KServe: ModelStreamInfer (tensor)
loop for each chunk
KServe->>Stream: receive tensor chunk
Stream-->>KServe: NvCreateTensorResponse (chunk)
KServe->>Wrap: Wrap + set to_raw_output_contents
KServe->>Build: TryFrom(ExtendedNvCreateTensorResponse)
Build-->>Client: ModelStreamInferResponse (chunk)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/llm/tests/kserve_service.rs (1)
268-280
: Align tensor shape with provided contents.
int_input
defaults to three elements but reports a shape of[1]
, which can hide mismatches during validation. Consider deriving the shape from the fixture payload to keep expectations consistent.inference::model_infer_request::InferInputTensor { name: "int_input".into(), datatype: "UINT32".into(), - shape: vec![1], + shape: vec![input.len() as i64], contents: Some(inference::InferTensorContents { uint_contents: input, ..Default::default() }),
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/llm/src/grpc/service/kserve.rs
(4 hunks)lib/llm/src/grpc/service/tensor.rs
(3 hunks)lib/llm/tests/kserve_service.rs
(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
lib/llm/src/grpc/service/tensor.rs (2)
lib/llm/src/grpc/service/openai.rs (3)
try_from
(194-319)try_from
(325-374)try_from
(380-391)lib/llm/src/protocols/tensor.rs (1)
len
(80-95)
lib/llm/src/grpc/service/kserve.rs (2)
lib/llm/src/grpc/service/tensor.rs (4)
tensor_response_stream
(52-128)try_from
(199-246)try_from
(469-496)try_from
(616-627)lib/llm/src/protocols/tensor.rs (1)
from_annotated_stream
(249-292)
lib/llm/tests/kserve_service.rs (3)
lib/llm/src/discovery/model_manager.rs (3)
default
(51-53)default
(345-351)new
(57-66)lib/llm/src/model_card.rs (2)
with_name_only
(190-196)name
(223-225)lib/llm/src/protocols/tensor.rs (2)
data_type
(97-112)len
(80-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (.)
- GitHub Check: clippy (lib/runtime/examples)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
lib/llm/src/grpc/service/kserve.rs (1)
198-213
: Nice mirror of raw input semantics.Capturing
to_raw_output_contents
once and threading it through the extended tensor response cleanly aligns our behavior with Triton clients that always operate on raw buffers. This keeps the typed path untouched while guaranteeing raw outputs only when the request asked for them.lib/llm/src/grpc/service/tensor.rs (1)
500-524
: Solid raw-output encoder.The per-variant handling here preserves Triton’s little-endian expectations and keeps BYTES tensors length-prefixed, so clients reading
raw_output_contents
will get a faithful mirror of what they sent.
Signed-off-by: Guan Luo <[email protected]>
/ok to test ec12389 |
1 similar comment
/ok to test ec12389 |
…roper-response-for-data-type
/ok to test 7b964c0 |
Signed-off-by: Guan Luo [email protected]
Overview:
To support Triton Inference Server <-> Dynamo migration, this PR enhance Dynamo gRPC frontend to return output data in raw_output_contents as expected by Triton gRPC client.
Note that although Triton gRPC client always send / receive data via raw_input_contents / raw_output_contents field. For completeness, Dynamo gRPC frontend will return data in the same way as the input data, i.e. output data will be returned in raw_output_contents if and only if input data is received in raw_input_contents
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Refactor
Tests