Fix sliding window mtp #1840

fsx950223 · 2026-01-14T10:01:13Z

Motivation

KV_BLOCK_SIZE=1024 is supported by the cache layout, but the PS (partitioned-softmax) decode path previously assumed smaller KV block sizes and could:

Produce incorrect results / NaNs for block_size=1024
Hit GPU memory access faults when sliding_window>0
Fail to compile the PS reduce kernel for large context_partition_num due to Triton tensor size limits

This PR makes PS decode robust for KV_BLOCK_SIZE=1024 and fixes PS reduction compilation/resource issues.

Technical Details

1) `paged_attention_decode_sliding_window`: add `KV_BLOCK_SIZE=1024` support

Allow KV_BLOCK_SIZE in [16, 64, 1024].
For KV_BLOCK_SIZE==1024, treat the KV page as 4 tiles of 256 tokens:
- KV_COMPUTE_BLOCK_SIZE = CONTEXT_PARTITION_SIZE (256)
- Compute a per-partition page_offset ∈ {0, 256, 512, 768} and apply it to:
  - key/value loads
  - per-token KV scale loads
Use runtime stride_key_block_elem when stepping through KV elements to match the actual key cache layout.

2) PS wrapper fixes

Correctly set one-shot mode for PS decode:
- pass ONE_SHOT=(num_splits <= 1) into paged_attention_decode_sliding_window
- fixes crashes/incorrect behavior when only one split is used.
Tune launch parameters for stability/perf:
- KV_BLOCK_SIZE==1024: waves_per_eu=1
- otherwise: waves_per_eu=4
- use num_stages=1

3) PS reduce kernel: avoid Triton `numel` limit and shared memory overflow

paged_attention_decode_ps_reduce_kernel now reduces partitions in chunks (two-pass reduction), instead of materializing tensors sized by next_power_of_2(context_partition_num).
Cap the chunk size to <= 8 partitions:
- avoids ValueError('numel (...) exceeds triton maximum tensor numel (1048576)')
- avoids shared-memory overflow for common configs (e.g. qg=64, head=128).

Test Plan

op_tests/triton_tests/test_pa_decode_gluon.py:
- block_size=1024, context_partition_size=256, kv_varlen=True, trans_v=False
- verify sliding_window=0 and sliding_window=128
- verify batch_size=1 and batch_size=128
Regression sanity:
- spot-check PS path with block_size=16 using same harness.

Test Result

All above tests passed locally.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This pull request fixes sliding window attention with Multi-Token Processing (MTP) in the paged attention decode implementation, adding support for KV_BLOCK_SIZE=1024 and improving the sliding window causal masking logic.

Changes:

Added support for KV_BLOCK_SIZE=1024 in sliding window kernels with appropriate page offset calculations and windowing masks
Fixed causal masking for sliding window to correctly handle per-query-position windows
Reorganized kernel code for better performance by moving initialization earlier and consolidating the PS path
Reduced MAX_CONTEXT_PARTITION_NUM from 16 to 8 to avoid exceeding shared memory limits
Expanded test coverage for sliding window scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
op_tests/triton_tests/test_pa_decode_gluon.py	Tightened diff tolerance from 8e-2 to 5e-2 and expanded test coverage with additional head dimensions, quantization modes, and configurations
aiter/ops/triton/gluon/pa_decode_gluon.py	Added KV_BLOCK_SIZE=1024 support with page offset handling, fixed sliding window causal masking, reorganized initialization code, reduced MAX_CONTEXT_PARTITION_NUM to 8, and moved PS kernel path to top of wrapper

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aiter/ops/triton/gluon/pa_decode_gluon.py

fsx950223 added 3 commits January 14, 2026 07:02

fix mtp accuracy

c4d6907

update unit test

6caa27d

support block size 1024

f15ec41

fsx950223 requested review from a team and Copilot January 14, 2026 10:01

Copilot started reviewing on behalf of fsx950223 January 14, 2026 10:02 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

aiter/ops/triton/gluon/pa_decode_gluon.py Show resolved Hide resolved

remove duplicate code

2d9d1d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sliding window mtp #1840

Fix sliding window mtp #1840

Uh oh!

fsx950223 commented Jan 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix sliding window mtp #1840

Are you sure you want to change the base?

Fix sliding window mtp #1840

Uh oh!

Conversation

fsx950223 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

1) paged_attention_decode_sliding_window: add KV_BLOCK_SIZE=1024 support

2) PS wrapper fixes

3) PS reduce kernel: avoid Triton numel limit and shared memory overflow

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fsx950223 commented Jan 14, 2026 •

edited

Loading

1) `paged_attention_decode_sliding_window`: add `KV_BLOCK_SIZE=1024` support

3) PS reduce kernel: avoid Triton `numel` limit and shared memory overflow