fix: avoid offload redundant prefill blocks | fix cuda graph hanging #3632

ziqifan617 · 2025-10-15T03:04:26Z

Overview:

By moving current_position and evaluated_blocks to the correct position, this PR will

avoid offload redundant prefill blocks
fix cuda graph hanging

Before this PR, when sending the same request twice (e.g. ISL 196, OSL 50), the expected behavior is - in the first request, we offload 12 blocks (16 x 12 = 192) and no blocks should be offloaded in the second request. This is due to all 12 blocks would be prefix cache hit and we do not offload decode blocks for vllm.

However, the actual behavior is in the second request, we would offload 3 blocks, since we do not advance current_position and evaluated_blocks with the num_computed_tokens accordingly and current_position and evaluated_blocks will start with 0

closes DIS-824

Summary by CodeRabbit

Refactor
- Renamed a parameter in a public interface for clarity and consistency; custom integrations implementing this interface may need minor updates.
Chores
- Added an info-level log before applying scheduler output, reporting computed and scheduled token counts to aid observability.

Signed-off-by: Ziqi Fan <[email protected]>

copy-pr-bot · 2025-10-15T03:04:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-15T03:04:39Z

Walkthrough

Adds an info log in KvConnectorLeader::build_connector_metadata and updates the Slot trait method apply_scheduler_output to use num_computed_tokens (renaming from _num_computed_tokens), with implementations adjusted to reference it and add related instrumentation around current_position and evaluated_blocks.

Changes

Cohort / File(s)	Summary
Leader connector logging `lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs`	Inserted an info log before applying scheduler output, logging num_computed_tokens and scheduled_tokens.
Slot trait param rename and instrumentation `lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs`	Renamed Slot::apply_scheduler_output third parameter to num_computed_tokens (from _num_computed_tokens) without type/position changes; updated implementations to use it and added instrumentation/logs around current_position and evaluated_blocks updates.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I tap my paws on silent logs, tick-tick, hooray!
Tokens counted, schedules set, we’re on our way.
A name unmasked—no underscore to hide—
Slots now speak with measured pride.
Hippity-hop, the ledger’s bright,
Bytes and burrows align just right.

Pre-merge checks

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	⚠️ Warning	The pull request title refers to avoiding offload redundant prefill blocks and fixing CUDA graph hanging, but the actual changes introduce logging in KvConnectorLeader and rename a trait parameter in apply_scheduler_output, so the title does not accurately reflect or summarize the implemented modifications.	Please update the title to describe the main changes, for example “chore: rename apply_scheduler_output parameter and add info logging of computed and scheduled tokens in KvConnectorLeader,” so it clearly summarizes the implemented modifications.
Description Check	⚠️ Warning	The pull request description includes an Overview and a related issue link but omits the required Details and Where should the reviewer start sections from the repository template, and the Related Issues entry does not follow the prescribed “closes GitHub issue: #xxx” format.	Please populate the Details section with a concise summary of the specific code changes, add a Where should the reviewer start section listing key files or functions to inspect, and update the Related Issues section to reference actual GitHub issue numbers using the required “closes GitHub issue: #123” format.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ziqifan617 · 2025-10-15T03:05:01Z

/ok to test 9dea6e5

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

478-625: Add Rust unit tests for apply_scheduler_output in slot.rs
Integration tests exist in lib/bindings/python/tests/test_kvbm_vllm_integration.py, but please add Rust unit tests in slot.rs covering:

current_position does not regress when num_computed_tokens < current_position

evaluated_blocks does not decrease and redundant blocks are not offloaded twice

Redundant offloading is prevented under repeated scheduler applications

🧹 Nitpick comments (5)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1)
389-392: Consider using debug! level instead of info! for internal state logging.

This log tracks internal scheduler parameters and appears to be debugging instrumentation rather than a significant operational event. Info-level logs should be reserved for important state changes visible to operators.

Apply this diff to adjust the log level:
-            tracing::info!(
+            tracing::debug!(
                 "Applying scheduler output for num_computed_tokens: {}, scheduled_tokens: {}",
                 new_req.num_computed_tokens, scheduled_tokens
             );
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (4)
485-491: Consider using debug! level for internal parameter logging.

This log tracks function parameters and internal state, which is typically debug-level information rather than info-level.

Apply this diff:
-        tracing::info!(
+        tracing::debug!(
             "apply_scheduler_output: tokens: {}, block_ids: {}, num_computed_tokens: {}, num_scheduled_tokens: {}",
             tokens.len(),
             block_ids.len(),
             num_computed_tokens,
             num_scheduled_tokens
         );
493-497: Reduce log level to debug! for internal state tracking.

This log captures internal state before updates, which is debug-level information.

Apply this diff:
-        tracing::info!(
+        tracing::debug!(
             "before advancing current_position and evaluated_blocks: {}, {}",
             self.current_position,
             self.evaluated_blocks
         );
514-518: Reduce log level to debug! or trace! for internal state tracking.

Multiple info-level logs throughout the function track internal state and calculations. These should be debug-level to avoid log noise in production.

Apply this diff to all these log statements:
-        tracing::info!(
+        tracing::debug!(
Affected lines: 514, 528, 542, 547, 611

Also applies to: 528-533, 542-547, 611-611

1320-1323: Consider using debug! level for offload request logging.

While offload operations are significant, logging every offload request at info level may create excessive log volume. Consider using debug level unless this is intentionally a high-visibility operation.

Apply this diff:
-    tracing::info!(
+    tracing::debug!(
         "Processing offload request for request_id: {}, operation_id: {}, blocks: {:?}",
         request_id, operation_id, offload_req.block_ids,
     );

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a1391e and 9dea6e5.

📒 Files selected for processing (2)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (1)

lib/bindings/python/rust/llm/block_manager/vllm.rs (1)

num_computed_tokens (113-118)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (3)

lib/bindings/python/rust/llm/block_manager/vllm.rs (1)

num_computed_tokens (113-118)

lib/llm/src/mocker/sequence.rs (1)

len (103-105)

lib/llm/src/tokens.rs (1)

total_tokens (804-807)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: trtllm (arm64)
GitHub Check: vllm (arm64)
GitHub Check: sglang
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

101-107: No action required for apply_scheduler_output: the sole Slot implementation (VllmConnectorSlot) in this module has been updated to the new parameter names; there are no other impls to adjust.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs

Signed-off-by: Ziqi Fan <[email protected]>

ziqifan617 · 2025-10-15T05:25:35Z

/ok to test 63e4bd6

oandreeva-nv · 2025-10-15T17:09:15Z

However, the actual behavior is in the second request, we would offload 3 blocks, since we do not advance current_position and evaluated_blocks with the num_computed_tokens accordingly and current_position and evaluated_blocks will start with 0

Can you clarify this a bit, please?

ziqifan617 · 2025-10-15T17:12:43Z

However, the actual behavior is in the second request, we would offload 3 blocks, since we do not advance current_position and evaluated_blocks with the num_computed_tokens accordingly and current_position and evaluated_blocks will start with 0

Can you clarify this a bit, please?

let me explain offline

fix: avoid offload redundent prefill blocks | fix cuda graph hanging

9dea6e5

Signed-off-by: Ziqi Fan <[email protected]>

ziqifan617 requested a review from a team as a code owner October 15, 2025 03:04

pull-request-size bot added the size/M label Oct 15, 2025

github-actions bot added the fix label Oct 15, 2025

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs Outdated Show resolved Hide resolved

clean up printing

63e4bd6

Signed-off-by: Ziqi Fan <[email protected]>

pull-request-size bot added size/XS and removed size/M labels Oct 15, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 05:25 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 05:26 Inactive

ziqifan617 changed the title ~~fix: avoid offload redundent prefill blocks | fix cuda graph hanging~~ fix: avoid offload redundant prefill blocks | fix cuda graph hanging Oct 15, 2025

ziqifan617 requested review from jthomson04, richardhuo-nv and ryanolson October 15, 2025 05:35

richardhuo-nv approved these changes Oct 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: avoid offload redundant prefill blocks | fix cuda graph hanging #3632

fix: avoid offload redundant prefill blocks | fix cuda graph hanging #3632

ziqifan617 commented Oct 15, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

ziqifan617 commented Oct 15, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ziqifan617 commented Oct 15, 2025

Uh oh!

oandreeva-nv commented Oct 15, 2025

Uh oh!

ziqifan617 commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: avoid offload redundant prefill blocks | fix cuda graph hanging #3632

Are you sure you want to change the base?

fix: avoid offload redundant prefill blocks | fix cuda graph hanging #3632

Conversation

ziqifan617 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

ziqifan617 commented Oct 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ziqifan617 commented Oct 15, 2025

Uh oh!

oandreeva-nv commented Oct 15, 2025

Uh oh!

ziqifan617 commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ziqifan617 commented Oct 15, 2025 •

edited

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading