-
Notifications
You must be signed in to change notification settings - Fork 180
mla ps support page64 and add paged 3buffer layout for ds3.2 #1875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for a paged 3-buffer layout to the MLA (Multi-head Latent Attention) persistent kernel implementation for DeepSeek v3.2. The 3-buffer layout uses separate buffers for FP8-quantized nope data, FP32 scale factors, and BF16 rope data, all packed together as uint8 bytes.
Changes:
- Adds a new 3BUFFER paged layout option with scale_dim parameter for managing quantized KV cache data
- Updates copyright years across multiple files to 2024-2026
- Adds kernel selection logic for uint8 dtype handling in the CUDA backend
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| op_tests/test_mla_persistent.py | Adds test infrastructure for 3-buffer layout including data packing/unpacking logic, new test function, and command-line arguments |
| hsa/gfx942/mla/mla.co | New compiled kernel binary for 3-buffer layout support |
| csrc/py_itfs_cu/asm_mla.cu | Updates kernel selection logic to handle uint8 dtype and adds new kernel path for 3-buffer layout |
| aiter/mla.py | Updates shape extraction logic to handle uint8 dtype differently from standard dtypes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| err_ref_fp8 = checkAllclose( | ||
| out_ref, | ||
| out_ref_fp8, | ||
| msg="mla_decode-absorb_fp8 [golden fp8 vs golden]:......", | ||
| ) |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'err_ref_fp8' is assigned but never used. Consider removing it or using it in the test assertions.
| type=int, | ||
| default=128, | ||
| default=512, | ||
| help="""qk nope head dim. |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for qk_nope_head_dim changed from 128 to 512, but the help text still shows the old default value (128) in the example. Update the help text to reflect the new default or explain why this change was made.
| help="""qk nope head dim. | |
| help="""qk nope head dim. | |
| Default is 512 (updated from 128 to match the current MLA kernel configuration). |
| { | ||
| if(persistent) | ||
| { | ||
| // todo:add support condition |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment "todo:add support condition" is vague and does not specify what conditions need to be added or when this should be addressed. Either implement the missing conditions or provide a more specific TODO with details about what needs to be done.
| // todo:add support condition | |
| // TODO: define and enforce the exact support conditions for the persistent Byte/Char KV path. | |
| // This placeholder kernel selection should only be used once: | |
| // - the correct assembly kernel symbol and .co file for Byte/Char KV are available, and | |
| // - guards on shapes and configuration (e.g., gqa_ratio, sequence lengths, scaling tensors) | |
| // are added here to match the constraints of that kernel. |
| default=1, | ||
| default=64, | ||
| help="""Block size. | ||
| e.g.: -blk 1""", |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for block_size changed from 1 to 64, but the help text still shows the old default value (1) in the example. Update the help text to reflect the new default or explain why this change was made.
| e.g.: -blk 1""", | |
| e.g.: -blk 64""", |
236523a to
51d8feb
Compare
51d8feb to
b568d85
Compare
ba51370 to
b78b537
Compare
| --varlen # True""", | ||
| ) | ||
|
|
||
| parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add description for "LEGACY" & "3BUFFER“
op_tests/test_mla_persistent.py
Outdated
| reduce_partial_map, | ||
| kv_granularity=max(page_size, 16), | ||
| page_size=page_size, | ||
| kv_granularity=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 2 is a optimized setting for the case page_size=64 ? I thought kv_granularity should be set as max(1, 16/page_size). However, this formula doesn't match your setting here. If it doesn't work, I think we need a function to get the granularity which takes max(1, 16/page_size) as default.
|
|
||
| // expected payload handled by each cu part. | ||
| const int32_t payload = ck_tile::integer_divide_ceil(sum_blocks, params.num_splits) + | ||
| Traits::kFixedOverheadNumBlocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to change the fixed coefficient kFixedOverheadNumBlocks based on page size. It is not a divisor so we may can make it be a runtime parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to me to do this in further PR.
3c51e2c to
ddb11de
Compare
Motivation
mla ps add paged 3buffer layout for ds3.2
Technical Details
Test Plan
Test Result
Submission Checklist