-
Notifications
You must be signed in to change notification settings - Fork 641
fix: avoid offload redundant prefill blocks | fix cuda graph hanging #3632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ziqi Fan <[email protected]>
WalkthroughAdds an info log in KvConnectorLeader::build_connector_metadata and updates the Slot trait method apply_scheduler_output to use num_computed_tokens (renaming from _num_computed_tokens), with implementations adjusted to reference it and add related instrumentation around current_position and evaluated_blocks. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks❌ Failed checks (3 warnings)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
/ok to test 9dea6e5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)
478-625
: Add Rust unit tests forapply_scheduler_output
in slot.rs
Integration tests exist in lib/bindings/python/tests/test_kvbm_vllm_integration.py, but please add Rust unit tests in slot.rs covering:
current_position
does not regress whennum_computed_tokens < current_position
evaluated_blocks
does not decrease and redundant blocks are not offloaded twice- Redundant offloading is prevented under repeated scheduler applications
🧹 Nitpick comments (5)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1)
389-392
: Consider usingdebug!
level instead ofinfo!
for internal state logging.This log tracks internal scheduler parameters and appears to be debugging instrumentation rather than a significant operational event. Info-level logs should be reserved for important state changes visible to operators.
Apply this diff to adjust the log level:
- tracing::info!( + tracing::debug!( "Applying scheduler output for num_computed_tokens: {}, scheduled_tokens: {}", new_req.num_computed_tokens, scheduled_tokens );lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (4)
485-491
: Consider usingdebug!
level for internal parameter logging.This log tracks function parameters and internal state, which is typically debug-level information rather than info-level.
Apply this diff:
- tracing::info!( + tracing::debug!( "apply_scheduler_output: tokens: {}, block_ids: {}, num_computed_tokens: {}, num_scheduled_tokens: {}", tokens.len(), block_ids.len(), num_computed_tokens, num_scheduled_tokens );
493-497
: Reduce log level todebug!
for internal state tracking.This log captures internal state before updates, which is debug-level information.
Apply this diff:
- tracing::info!( + tracing::debug!( "before advancing current_position and evaluated_blocks: {}, {}", self.current_position, self.evaluated_blocks );
514-518
: Reduce log level todebug!
ortrace!
for internal state tracking.Multiple info-level logs throughout the function track internal state and calculations. These should be debug-level to avoid log noise in production.
Apply this diff to all these log statements:
- tracing::info!( + tracing::debug!(Affected lines: 514, 528, 542, 547, 611
Also applies to: 528-533, 542-547, 611-611
1320-1323
: Consider usingdebug!
level for offload request logging.While offload operations are significant, logging every offload request at info level may create excessive log volume. Consider using debug level unless this is intentionally a high-visibility operation.
Apply this diff:
- tracing::info!( + tracing::debug!( "Processing offload request for request_id: {}, operation_id: {}, blocks: {:?}", request_id, operation_id, offload_req.block_ids, );
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs
(1 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs
(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1)
lib/bindings/python/rust/llm/block_manager/vllm.rs (1)
num_computed_tokens
(113-118)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (3)
lib/bindings/python/rust/llm/block_manager/vllm.rs (1)
num_computed_tokens
(113-118)lib/llm/src/mocker/sequence.rs (1)
len
(103-105)lib/llm/src/tokens.rs (1)
total_tokens
(804-807)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: sglang
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)
101-107
: No action required forapply_scheduler_output
: the soleSlot
implementation (VllmConnectorSlot
) in this module has been updated to the new parameter names; there are no other impls to adjust.
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs
Outdated
Show resolved
Hide resolved
Signed-off-by: Ziqi Fan <[email protected]>
/ok to test 63e4bd6 |
Can you clarify this a bit, please? |
let me explain offline |
Overview:
By moving
current_position
andevaluated_blocks
to the correct position, this PR willBefore this PR, when sending the same request twice (e.g. ISL 196, OSL 50), the expected behavior is - in the first request, we offload 12 blocks (16 x 12 = 192) and no blocks should be offloaded in the second request. This is due to all 12 blocks would be prefix cache hit and we do not offload decode blocks for vllm.
However, the actual behavior is in the second request, we would offload 3 blocks, since we do not advance
current_position
andevaluated_blocks
with thenum_computed_tokens
accordingly and current_position and evaluated_blocks will start with 0closes DIS-824
Summary by CodeRabbit
Refactor
Chores