Skip to content

Conversation

ziqifan617
Copy link
Contributor

@ziqifan617 ziqifan617 commented Oct 15, 2025

Overview:

By moving current_position and evaluated_blocks to the correct position, this PR will

  • avoid offload redundant prefill blocks
  • fix cuda graph hanging

Before this PR, when sending the same request twice (e.g. ISL 196, OSL 50), the expected behavior is - in the first request, we offload 12 blocks (16 x 12 = 192) and no blocks should be offloaded in the second request. This is due to all 12 blocks would be prefix cache hit and we do not offload decode blocks for vllm.

However, the actual behavior is in the second request, we would offload 3 blocks, since we do not advance current_position and evaluated_blocks with the num_computed_tokens accordingly and current_position and evaluated_blocks will start with 0

closes DIS-824

Summary by CodeRabbit

  • Refactor

    • Renamed a parameter in a public interface for clarity and consistency; custom integrations implementing this interface may need minor updates.
  • Chores

    • Added an info-level log before applying scheduler output, reporting computed and scheduled token counts to aid observability.

@ziqifan617 ziqifan617 requested a review from a team as a code owner October 15, 2025 03:04
Copy link

copy-pr-bot bot commented Oct 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

Walkthrough

Adds an info log in KvConnectorLeader::build_connector_metadata and updates the Slot trait method apply_scheduler_output to use num_computed_tokens (renaming from _num_computed_tokens), with implementations adjusted to reference it and add related instrumentation around current_position and evaluated_blocks.

Changes

Cohort / File(s) Summary
Leader connector logging
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs
Inserted an info log before applying scheduler output, logging num_computed_tokens and scheduled_tokens.
Slot trait param rename and instrumentation
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs
Renamed Slot::apply_scheduler_output third parameter to num_computed_tokens (from _num_computed_tokens) without type/position changes; updated implementations to use it and added instrumentation/logs around current_position and evaluated_blocks updates.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I tap my paws on silent logs, tick-tick, hooray!
Tokens counted, schedules set, we’re on our way.
A name unmasked—no underscore to hide—
Slots now speak with measured pride.
Hippity-hop, the ledger’s bright,
Bytes and burrows align just right.

Pre-merge checks

❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ⚠️ Warning The pull request title refers to avoiding offload redundant prefill blocks and fixing CUDA graph hanging, but the actual changes introduce logging in KvConnectorLeader and rename a trait parameter in apply_scheduler_output, so the title does not accurately reflect or summarize the implemented modifications. Please update the title to describe the main changes, for example “chore: rename apply_scheduler_output parameter and add info logging of computed and scheduled tokens in KvConnectorLeader,” so it clearly summarizes the implemented modifications.
Description Check ⚠️ Warning The pull request description includes an Overview and a related issue link but omits the required Details and Where should the reviewer start sections from the repository template, and the Related Issues entry does not follow the prescribed “closes GitHub issue: #xxx” format. Please populate the Details section with a concise summary of the specific code changes, add a Where should the reviewer start section listing key files or functions to inspect, and update the Related Issues section to reference actual GitHub issue numbers using the required “closes GitHub issue: #123” format.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ziqifan617
Copy link
Contributor Author

/ok to test 9dea6e5

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

478-625: Add Rust unit tests for apply_scheduler_output in slot.rs
Integration tests exist in lib/bindings/python/tests/test_kvbm_vllm_integration.py, but please add Rust unit tests in slot.rs covering:

  • current_position does not regress when num_computed_tokens < current_position
  • evaluated_blocks does not decrease and redundant blocks are not offloaded twice
  • Redundant offloading is prevented under repeated scheduler applications
🧹 Nitpick comments (5)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1)

389-392: Consider using debug! level instead of info! for internal state logging.

This log tracks internal scheduler parameters and appears to be debugging instrumentation rather than a significant operational event. Info-level logs should be reserved for important state changes visible to operators.

Apply this diff to adjust the log level:

-            tracing::info!(
+            tracing::debug!(
                 "Applying scheduler output for num_computed_tokens: {}, scheduled_tokens: {}",
                 new_req.num_computed_tokens, scheduled_tokens
             );
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (4)

485-491: Consider using debug! level for internal parameter logging.

This log tracks function parameters and internal state, which is typically debug-level information rather than info-level.

Apply this diff:

-        tracing::info!(
+        tracing::debug!(
             "apply_scheduler_output: tokens: {}, block_ids: {}, num_computed_tokens: {}, num_scheduled_tokens: {}",
             tokens.len(),
             block_ids.len(),
             num_computed_tokens,
             num_scheduled_tokens
         );

493-497: Reduce log level to debug! for internal state tracking.

This log captures internal state before updates, which is debug-level information.

Apply this diff:

-        tracing::info!(
+        tracing::debug!(
             "before advancing current_position and evaluated_blocks: {}, {}",
             self.current_position,
             self.evaluated_blocks
         );

514-518: Reduce log level to debug! or trace! for internal state tracking.

Multiple info-level logs throughout the function track internal state and calculations. These should be debug-level to avoid log noise in production.

Apply this diff to all these log statements:

-        tracing::info!(
+        tracing::debug!(

Affected lines: 514, 528, 542, 547, 611

Also applies to: 528-533, 542-547, 611-611


1320-1323: Consider using debug! level for offload request logging.

While offload operations are significant, logging every offload request at info level may create excessive log volume. Consider using debug level unless this is intentionally a high-visibility operation.

Apply this diff:

-    tracing::info!(
+    tracing::debug!(
         "Processing offload request for request_id: {}, operation_id: {}, blocks: {:?}",
         request_id, operation_id, offload_req.block_ids,
     );
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a1391e and 9dea6e5.

📒 Files selected for processing (2)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (7 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1)
lib/bindings/python/rust/llm/block_manager/vllm.rs (1)
  • num_computed_tokens (113-118)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (3)
lib/bindings/python/rust/llm/block_manager/vllm.rs (1)
  • num_computed_tokens (113-118)
lib/llm/src/mocker/sequence.rs (1)
  • len (103-105)
lib/llm/src/tokens.rs (1)
  • total_tokens (804-807)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: sglang
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (.)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

101-107: No action required for apply_scheduler_output: the sole Slot implementation (VllmConnectorSlot) in this module has been updated to the new parameter names; there are no other impls to adjust.

Signed-off-by: Ziqi Fan <[email protected]>
@ziqifan617
Copy link
Contributor Author

/ok to test 63e4bd6

@ziqifan617 ziqifan617 changed the title fix: avoid offload redundent prefill blocks | fix cuda graph hanging fix: avoid offload redundant prefill blocks | fix cuda graph hanging Oct 15, 2025
@oandreeva-nv
Copy link
Contributor

However, the actual behavior is in the second request, we would offload 3 blocks, since we do not advance current_position and evaluated_blocks with the num_computed_tokens accordingly and current_position and evaluated_blocks will start with 0

Can you clarify this a bit, please?

@ziqifan617
Copy link
Contributor Author

However, the actual behavior is in the second request, we would offload 3 blocks, since we do not advance current_position and evaluated_blocks with the num_computed_tokens accordingly and current_position and evaluated_blocks will start with 0

Can you clarify this a bit, please?

let me explain offline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants