-
Notifications
You must be signed in to change notification settings - Fork 641
feat: trtllm conditional disaggregation #3640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: trtllm conditional disaggregation #3640
Conversation
👋 Hi shpgy-shpgy! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughAdds engine-specific disaggregation_mode to runtime config, introduces environment-driven conditional disaggregation for short-prefill in DecodeHandler and falls back to local prefill on errors, changes handler_base to log instead of raising when disaggregated params are missing in Decode, and gates KV-router worker logit insertion by an optional ISL threshold. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant DecodeHandler
participant RemotePrefill as Remote Prefill
participant LocalPrefill as Local Prefill
Client->>DecodeHandler: generate(request)
DecodeHandler->>DecodeHandler: read env/config (conditional disaggregation, threshold)
DecodeHandler->>DecodeHandler: compute ISL token count
alt conditional enabled AND ISL <= threshold
DecodeHandler->>LocalPrefill: prefill locally (short prefill path)
LocalPrefill-->>DecodeHandler: local prefill result
DecodeHandler-->>Client: continue decode with local prefill
else
DecodeHandler->>RemotePrefill: prefill remotely (single-response path)
RemotePrefill-->>DecodeHandler: response (or error)
alt remote returned error
DecodeHandler->>LocalPrefill: fallback to local prefill
LocalPrefill-->>DecodeHandler: local prefill result
DecodeHandler-->>Client: continue decode with local prefill
else
DecodeHandler-->>Client: continue decode with remote prefill (propagate disaggregated_params if present)
end
end
sequenceDiagram
autonumber
participant Router as KV Router
participant Worker[i] as Worker[i]
Note over Router: If KV_ROUTER_USE_ISL_THRESHOLD=true<br/>use KV_ROUTER_ISL_THRESHOLD
loop for each worker i
Router->>Worker[i]: read runtime data (disaggregation_mode)
Router->>Router: evaluate ISL tokens vs threshold and worker mode
alt worker passes gating
Router->>Router: insert worker logit
else
Router->>Router: skip worker logit
end
end
Router->>Router: update max_logit, compute softmax, select
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
7db771a
to
79eac2a
Compare
79eac2a
to
fca518c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
components/src/dynamo/trtllm/request_handlers/handlers.py (3)
4-7
: Remove unusedasyncio
import.The
asyncio
import on line 7 is not used in the visible code changes. Remove it to keep imports clean, unless it's used elsewhere in the file not shown in this diff.Run this script to verify
asyncio
usage:#!/bin/bash # Check if asyncio is used in handlers.py rg -n '\basyncio\.' components/src/dynamo/trtllm/request_handlers/handlers.py
263-271
: Clarify the short prefill local handling path.When
isl_tokens <= threshold
and conditional disaggregation is enabled, the code logs "Short prefill, handled locally" but doesn't explicitly skip the remote prefill call—it's implied by the else block. Consider adding an explicit early continuation or restructuring for clarity.Consider this more explicit structure:
if isl_tokens <= threshold and use_conditional_disaggregation: # Short prefill, handled locally - skip remote prefill logging.info("Short prefill (isl_tokens=%d <= threshold=%d), handled locally", isl_tokens, threshold) # Continue to local generation below else: # Long prefill, route to remote prefill worker async for res in self.remote_prefill(request, context): prefill_response = res response_count += 1 if response_count > 1: raise ValueError("Prefill response should be generated only once.")
271-271
: Use a custom exception class for single-response violations.The static analysis tool flags this line for specifying a long message directly in the exception. Consider defining a custom exception class to encapsulate this validation logic.
Based on static analysis hints, consider this refactor:
class PrefillResponseError(Exception): """Raised when prefill returns multiple responses when only one is expected.""" def __init__(self): super().__init__("Prefill response should be generated only once.")Then use:
raise PrefillResponseError()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/src/dynamo/trtllm/main.py
(2 hunks)components/src/dynamo/trtllm/request_handlers/handler_base.py
(1 hunks)components/src/dynamo/trtllm/request_handlers/handlers.py
(4 hunks)lib/llm/src/kv_router/scheduler.rs
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/src/dynamo/trtllm/request_handlers/handlers.py (1)
components/src/dynamo/trtllm/request_handlers/handler_base.py (3)
RequestHandlerConfig
(59-76)DisaggregationStrategy
(53-55)generate_locally
(151-309)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3640/merge) by shpgy-shpgy.
components/src/dynamo/trtllm/request_handlers/handlers.py
[error] 1-1: isort: files were modified by this hook during pre-commit
[error] 1-1: black: reformatted components/src/dynamo/trtllm/request_handlers/handlers.py
components/src/dynamo/trtllm/request_handlers/handler_base.py
[error] 1-1: black: reformatted components/src/dynamo/trtllm/request_handlers/handler_base.py
components/src/dynamo/trtllm/main.py
[error] 1-1: isort: files were modified by this hook during pre-commit
[error] 1-1: ruff: Found 1 error (1 fixed, 0 remaining) during pre-commit
🪛 Ruff (0.14.0)
components/src/dynamo/trtllm/request_handlers/handlers.py
271-271: Avoid specifying long messages outside the exception class
(TRY003)
components/src/dynamo/trtllm/main.py
10-10: Redefinition of unused json
from line 5
Remove definition: json
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: clippy (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
components/src/dynamo/trtllm/main.py (1)
329-331
: LGTM! Engine-specific disaggregation mode configuration.The disaggregation_mode is correctly serialized as JSON and stored in the runtime configuration, enabling per-worker mode detection in the scheduler.
components/src/dynamo/trtllm/request_handlers/handler_base.py (1)
207-208
: LGTM! Graceful degradation for missing disaggregated params.The change from raising a
ValueError
to logging an informational message enables Decode workers to fall back to local prefill when disaggregated params are missing. This aligns with the conditional disaggregation feature, allowing flexible routing based on request characteristics.components/src/dynamo/trtllm/request_handlers/handlers.py (3)
195-207
: LGTM! Clear environment variable handling with fallbacks.The environment variable reading logic is well-structured with proper error handling and fallback to config or default values. The warning message for invalid threshold values helps with debugging.
255-260
: Verify prefill_tokens assignment and token_ids type coverage
- No occurrences of setting
prefill_tokens
in the codebase—request.get("prefill_tokens")
always falls back to 0.isinstance(token_ids, (list, tuple))
excludes other sequence types (e.g. numpy.ndarray, torch.Tensor).
Confirm how upstream populates these fields and extend the logic to cover all expected input types.
280-287
: Review silent fallback for remote prefill errorsThe code now logs errors and always falls back to local prefill instead of propagating failures to clients, masking every error case. Confirm this aligns with intended UX and SLAs. Consider emitting a metric/counter for prefill failures and selectively surfacing critical errors (e.g. auth/authz) while silencing capacity or timeout issues.
lib/llm/src/kv_router/scheduler.rs (1)
542-542
: max_logit update is unused
max_logit
is never read after line 542; downstream logic usesworker_logits
directly.
1c8d03a
to
9382b3b
Compare
155d865
to
368c0c0
Compare
Signed-off-by: shpgy-shpgy <[email protected]>
Signed-off-by: shpgy-shpgy <[email protected]>
Signed-off-by: shpgy-shpgy <[email protected]>
Signed-off-by: shpgy-shpgy <[email protected]>
Signed-off-by: shpgy-shpgy <[email protected]>
Signed-off-by: shpgy-shpgy <[email protected]>
368c0c0
to
e4fdd1a
Compare
format format Signed-off-by: shpgy-shpgy <[email protected]> format Signed-off-by: shpgy-shpgy <[email protected]> format Signed-off-by: shpgy-shpgy <[email protected]> format format Signed-off-by: shpgy-shpgy <[email protected]> format Signed-off-by: shpgy-shpgy <[email protected]>
e4fdd1a
to
a46acac
Compare
Overview:
Add support for Conditional Disaggregation to TensorRT-LLM.
Route to PD worker or non-PD worker depending on input request length.
Where should the reviewer start?
components/src/dynamo/trtllm/request_handlers/handlers.py
lib/llm/src/kv_router/scheduler.rs
Summary by CodeRabbit
New Features
Bug Fixes