Skip to content

Conversation

@minmengdie
Copy link
Contributor

Motivation

mla ps add paged 3buffer layout for ds3.2

Technical Details

Test Plan

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for a paged 3-buffer layout to the MLA (Multi-head Latent Attention) persistent kernel implementation for DeepSeek v3.2. The 3-buffer layout uses separate buffers for FP8-quantized nope data, FP32 scale factors, and BF16 rope data, all packed together as uint8 bytes.

Changes:

  • Adds a new 3BUFFER paged layout option with scale_dim parameter for managing quantized KV cache data
  • Updates copyright years across multiple files to 2024-2026
  • Adds kernel selection logic for uint8 dtype handling in the CUDA backend

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 10 comments.

File Description
op_tests/test_mla_persistent.py Adds test infrastructure for 3-buffer layout including data packing/unpacking logic, new test function, and command-line arguments
hsa/gfx942/mla/mla.co New compiled kernel binary for 3-buffer layout support
csrc/py_itfs_cu/asm_mla.cu Updates kernel selection logic to handle uint8 dtype and adds new kernel path for 3-buffer layout
aiter/mla.py Updates shape extraction logic to handle uint8 dtype differently from standard dtypes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +513 to +517
err_ref_fp8 = checkAllclose(
out_ref,
out_ref_fp8,
msg="mla_decode-absorb_fp8 [golden fp8 vs golden]:......",
)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'err_ref_fp8' is assigned but never used. Consider removing it or using it in the test assertions.

Copilot uses AI. Check for mistakes.
type=int,
default=128,
default=512,
help="""qk nope head dim.
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for qk_nope_head_dim changed from 128 to 512, but the help text still shows the old default value (128) in the example. Update the help text to reflect the new default or explain why this change was made.

Suggested change
help="""qk nope head dim.
help="""qk nope head dim.
Default is 512 (updated from 128 to match the current MLA kernel configuration).

Copilot uses AI. Check for mistakes.
{
if(persistent)
{
// todo:add support condition
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment "todo:add support condition" is vague and does not specify what conditions need to be added or when this should be addressed. Either implement the missing conditions or provide a more specific TODO with details about what needs to be done.

Suggested change
// todo:add support condition
// TODO: define and enforce the exact support conditions for the persistent Byte/Char KV path.
// This placeholder kernel selection should only be used once:
// - the correct assembly kernel symbol and .co file for Byte/Char KV are available, and
// - guards on shapes and configuration (e.g., gqa_ratio, sequence lengths, scaling tensors)
// are added here to match the constraints of that kernel.

Copilot uses AI. Check for mistakes.
default=1,
default=64,
help="""Block size.
e.g.: -blk 1""",
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for block_size changed from 1 to 64, but the help text still shows the old default value (1) in the example. Update the help text to reflect the new default or explain why this change was made.

Suggested change
e.g.: -blk 1""",
e.g.: -blk 64""",

Copilot uses AI. Check for mistakes.
@minmengdie minmengdie force-pushed the mmd/dev/mla_ps_3buffer branch 2 times, most recently from 236523a to 51d8feb Compare January 21, 2026 09:04
@minmengdie minmengdie force-pushed the mmd/dev/mla_ps_3buffer branch from 51d8feb to b568d85 Compare January 21, 2026 09:05
@minmengdie minmengdie force-pushed the mmd/dev/mla_ps_3buffer branch from ba51370 to b78b537 Compare January 21, 2026 11:12
--varlen # True""",
)

parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add description for "LEGACY" & "3BUFFER“

reduce_partial_map,
kv_granularity=max(page_size, 16),
page_size=page_size,
kv_granularity=2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 2 is a optimized setting for the case page_size=64 ? I thought kv_granularity should be set as max(1, 16/page_size). However, this formula doesn't match your setting here. If it doesn't work, I think we need a function to get the granularity which takes max(1, 16/page_size) as default.


// expected payload handled by each cu part.
const int32_t payload = ck_tile::integer_divide_ceil(sum_blocks, params.num_splits) +
Traits::kFixedOverheadNumBlocks;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to change the fixed coefficient kFixedOverheadNumBlocks based on page size. It is not a divisor so we may can make it be a runtime parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to me to do this in further PR.

@minmengdie minmengdie changed the title mla ps add paged 3buffer layout for ds3.2 mla ps support page64 and add paged 3buffer layout for ds3.2 Jan 23, 2026
@minmengdie minmengdie force-pushed the mmd/dev/mla_ps_3buffer branch from 3c51e2c to ddb11de Compare January 23, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants