[Attention] Make `split_decodes_and_prefills(..., require_uniform=True)` support padding #29644

LucasWilkinson · 2025-11-28T04:50:20Z

with #28579 we pad attention metadata before building; in the case of a uniform decode request we want to make sure that num_decodes matches the cudagraph size so that attention schedulers receive the same batch size as the graph was captured with. Currently we have work around this in all the existing attention backends (e.g. vllm-project/FlashMLA#3) since we used to pad for attention after building attention metadata so this was always the case. But this is needed #27532 since the FlashMLA FP8 Sparse Kernels do not have this workaround yet.

The | (query_lens == 0) is removed from:

        is_prefill = (query_lens > decode_threshold) | (query_lens == 0)

as a small cleanup since this is not actually required (since this is actually not needed given we do actually want to treat them as decades in the case of full-decode batches and if there is a prefill it will come computed as the first prefill so these entries will be ignored anyways.

Test Plan:

CI

Signed-off-by: Lucas Wilkinson <[email protected]>

gemini-code-assist

Code Review

This pull request aims to add support for padded requests in split_decodes_and_prefills when require_uniform=True, which is important for cudagraph compatibility. The changes involve updating the splitting logic and adding a corresponding test case. While the intention is correct, I've found a critical issue in the implementation for detecting padded uniform batches. The current logic can fail if the first request is a padding request, leading to incorrect batch splitting. I've provided a detailed comment with a suggested fix to make the logic more robust. The rest of the changes look good.

vllm/v1/attention/backends/utils.py

fix

a19a294

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify bot added the v1 label Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

vllm/v1/attention/backends/utils.py Show resolved Hide resolved

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025

LucasWilkinson requested a review from benchislett November 28, 2025 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention] Make `split_decodes_and_prefills(..., require_uniform=True)` support padding #29644

[Attention] Make `split_decodes_and_prefills(..., require_uniform=True)` support padding #29644

LucasWilkinson commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Attention] Make split_decodes_and_prefills(..., require_uniform=True) support padding #29644

Are you sure you want to change the base?

[Attention] Make split_decodes_and_prefills(..., require_uniform=True) support padding #29644

Conversation

LucasWilkinson commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Attention] Make `split_decodes_and_prefills(..., require_uniform=True)` support padding #29644

[Attention] Make `split_decodes_and_prefills(..., require_uniform=True)` support padding #29644

LucasWilkinson commented Nov 28, 2025 •

edited by github-actions bot

Loading