Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Nov 28, 2025

with #28579 we pad attention metadata before building; in the case of a uniform decode request we want to make sure that num_decodes matches the cudagraph size so that attention schedulers receive the same batch size as the graph was captured with. Currently we have work around this in all the existing attention backends (e.g. vllm-project/FlashMLA#3) since we used to pad for attention after building attention metadata so this was always the case. But this is needed #27532 since the FlashMLA FP8 Sparse Kernels do not have this workaround yet.

The | (query_lens == 0) is removed from:

        is_prefill = (query_lens > decode_threshold) | (query_lens == 0)

as a small cleanup since this is not actually required (since this is actually not needed given we do actually want to treat them as decades in the case of full-decode batches and if there is a prefill it will come computed as the first prefill so these entries will be ignored anyways.

Test Plan:

CI

Signed-off-by: Lucas Wilkinson <[email protected]>
@mergify mergify bot added the v1 label Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add support for padded requests in split_decodes_and_prefills when require_uniform=True, which is important for cudagraph compatibility. The changes involve updating the splitting logic and adding a corresponding test case. While the intention is correct, I've found a critical issue in the implementation for detecting padded uniform batches. The current logic can fail if the first request is a padding request, leading to incorrect batch splitting. I've provided a detailed comment with a suggested fix to make the logic more robust. The rest of the changes look good.

@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant