XAttention for XE1 platform #33307

WeldonWangwang · 2025-12-18T12:26:48Z

Details:

Enable XAttention kernels for XE1 platforms (ARL-H and Arc)

Tickets:

CVS-178781

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

riverlijunjie · 2026-01-27T01:13:29Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

+                                    CacheHint::Cached,
+                                    CacheHint::Cached>(q_gather, gather_offsets, gather_pred);
+            rQ[ri].format<uint>()  = gathered;
+            rQ[ri].format<half>()  = cm_mul<half>(rQ[ri].format<half>(), (half)scale_factor);


Why not directly use gathered to do cm_mul to avoid one register copying?

Directly multiplying gathered increases load-use stalls (sync.nop +23). Splitting load and scale phases preserves load/mul counts and avoids extra scoreboard waits, so I kept this form. will update it.

riverlijunjie · 2026-01-27T01:14:59Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

-
+                #endif
+
+                // This condition only works for head_size <= 128


Add assert for this limitation?

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

src/plugins/intel_gpu/src/graph/impls/cm/pa_multi_token.cm

ceciliapeng2011 · 2026-02-02T02:59:48Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

-
+                #endif
+
+                // This condition only works for head_size <= 128


Please add a guard for the noted limitation (head_size <= 128).
Any way, is it a limitation on the XE1 non-LSC path solely? SLM capacity is a critical constraint.

Also consider static_assert(head_size % REG_N == 0) to catch misconfigurations.

src/plugins/intel_gpu/src/graph/impls/cm/pa_multi_token.cm

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

ceciliapeng2011 · 2026-02-02T05:40:20Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

+            half* prefetch_k_pos = (half*)k_cache_base + prefetch_block_id * blk_stride + ((prefetch_kv_pos + wg_local_id) % CMPA_BLOCK_SZ) * head_size;
+            cm_ptr_prefetch<REG_K/2, DataSize::U32, CacheHint::Cached, CacheHint::Cached>((const unsigned int *const)prefetch_k_pos, 0);


Although stateless prefetch/load works, I would also recommend use stateful api for K cache. 1. stateful api is more reliable in terms of out of boundary access. 2. memory access api is aligned for q, k/v caches, as I see all kinds of api are used in one kernel in a mixed way.
It is not a mandatory, of course.

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

ceciliapeng2011 · 2026-02-02T06:24:28Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

            }
        }
    }
-    if (q_tokens_left == 0) return;


Why do you remove this line for early exit?

The early-exit was moved earlier: we now return right after clamping q_tokens_left (if (q_tokens_left == 0) return;).
This makes the later guard redundant, so it was removed. The writeback path is never reached when q_tokens_left == 0.

ceciliapeng2011 · 2026-02-02T06:26:40Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

+        auto P2 = P.format<half, num_P_tiles, REG_M * REG_K>();
+        matrix<half, REG_K/2, REG_N*2*VALUE_TILE_NUM> Vmat;
+        #pragma unroll
+        for(int k = 0, ri=0; k < head_size; k += REG_N * VALUE_TILE_NUM, ri += num_P_tiles * VALUE_TILE_NUM) {


Any performance impact to when USE_LSC==1? We probably need a further check here.

Based on the existing tests, there should be no performance degradation, so we can conduct further tests

ceciliapeng2011 · 2026-02-02T06:48:22Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

+#endif
    matrix<half, head_size/REG_K, REG_K*REG_N> rQ;
-    matrix <float, head_size/REG_N*num_P_tiles, REG_M*REG_N> rO;
+    matrix <float, head_size/REG_M, REG_M*REG_N> rO;


For ARL-H, shape of rO is [head_size/REG_N, REG_M*REG_N]. Here it requires REG_M== REG_N. Is there any check/assert to guarantee this?

Add some assert.

ceciliapeng2011 · 2026-02-02T06:59:26Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp


    if (q_tokens_left < 0) q_tokens_left = 0;
    if (q_tokens_left > q_step) q_tokens_left = q_step;
+    if (q_tokens_left == 0) return;


Can we move this check ahead? Sounds not. We still need this thread to contribute for WG, like K/V cache prefetch, and dequant (for i8).

ceciliapeng2011 · 2026-02-02T07:14:57Z

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp

+#ifdef CM_HAS_LSC_UNTYPED_2D
+#define USE_LSC 1
+#else
+#define USE_LSC 0
+#endif


Sounds we can directly use CM_HAS_LSC_UNTYPED_2D?

Yes, update it.

ceciliapeng2011 · 2026-02-02T07:39:03Z

src/plugins/intel_gpu/src/graph/impls/cm/include/estimate.hpp

 CM_INLINE constexpr auto reduce2d(matrix_ref<T, N, M> src) {
    constexpr int group_size = M / group_count;
-    if constexpr (N > stop) {
+    if constexpr (N > stop && group_size > 1) {


Would you please explain we need this check "&& group_size > 1"? @fish-jiang

Please check the line 977, only need to call reduce_2d once, and no need to have hard code. Also, we can use different BLOCK_SG_M size without other kernel code change.

ceciliapeng2011 · 2026-02-02T07:43:05Z

src/plugins/intel_gpu/src/graph/impls/cm/include/estimate.hpp

+    // constexpr int BLOCK_WG_K = 64;	// same in sg  // because unroll 4 times along K ??
+    constexpr int SUM_N = BLOCK_SG_N / (BLOCK_SIZE/STRIDE);
+
+// // #ifndef BLOCK_SG_M
+//     #define BLOCK_SG_M  32
+//     #define BLOCK_SG_N  16
+//     #define SG_M  4
+//     #define SG_N  8
+//     #define HEAD_SIZE  128
+//     #define KV_BLOCK_SIZE  256
+//     #define STRIDE  16
+// // #endif


Remove this commented lines, please.

ceciliapeng2011 · 2026-02-02T07:54:31Z

src/plugins/intel_gpu/src/graph/impls/cm/include/estimate.hpp

+    #endif
    // 0~2 M[:]xK[0:16] 2~4 K[16:32]                                                     --> 32 * 2 regs
-    matrix<half, 2, BLOCK_REG_B> b0, b1;
+    matrix<half, REG_N, BLOCK_REG_B> b0, b1;      // ping-pong B


Is any problem here that we allocate REG_N number of BLOCK_REG_B registers? @fish-jiang

No hard code here, we can adjust BLOCK_SG_N size

ceciliapeng2011 · 2026-02-02T08:01:29Z

src/plugins/intel_gpu/src/graph/impls/cm/include/find_block.hpp

 #define CUR_TYPE CUR_TYPE_(SOFTMAX_TYPE)

+template <int M, int N>
+CM_INLINE void cm_load_2d(matrix_ref<SOFTMAX_TYPE, M, N> out,


There are similar function cm_load_2d in estimate.hpp too. We need a refactor in a sperate PR maybe to unite them.

ceciliapeng2011 · 2026-02-02T08:14:43Z

src/plugins/intel_gpu/src/graph/impls/cm/pa_kv_cache_update_ref.cm

        } else {
-            scale_val = 255.0 / (max_val - min_val);
-            zp_val = (0.0 - min_val) * scale_val;
+            scale_val = half(255.0) / (max_val - min_val);


Please keep the compute of scale_val in float precision. There is a known issue led by when it is calculated in half precision. Please check #33485

BTW, in this PR why do we change this file for ARL-H?

ceciliapeng2011

Some kernels are increasingly hard to read and maintain as more options extend the functionalities (to support kvcache compress types, xe1/xe2 arch, etc). We definitely need a refactor to improve it. Probably in a sperate PR.

pa_single_token.cm
estimate.hpp
find_blocks.hpp
cm_pa_common.hpp

ceciliapeng2011 · 2026-02-02T08:20:01Z

src/plugins/intel_gpu/src/graph/impls/cm/pa_single_token.cm

+template <typename T1, typename T2>
+CM_INLINE void Transpose_8x8(matrix_ref<T1, 8, 8> in, matrix_ref<T2, 8, 8> out) {
+  matrix<T2, 8, 8> temp;
+  temp.row(0) = in.template select<2, 1, 4, 2>(0, 0);
+  temp.row(1) = in.template select<2, 1, 4, 2>(2, 0);
+  temp.row(2) = in.template select<2, 1, 4, 2>(4, 0);
+  temp.row(3) = in.template select<2, 1, 4, 2>(6, 0);
+  temp.row(4) = in.template select<2, 1, 4, 2>(0, 1);
+  temp.row(5) = in.template select<2, 1, 4, 2>(2, 1);
+  temp.row(6) = in.template select<2, 1, 4, 2>(4, 1);
+  temp.row(7) = in.template select<2, 1, 4, 2>(6, 1);
+
+  out.row(0) = temp.template select<4, 1, 2, 4>(0, 0);
+  out.row(2) = temp.template select<4, 1, 2, 4>(0, 1);
+  out.row(4) = temp.template select<4, 1, 2, 4>(0, 2);
+  out.row(6) = temp.template select<4, 1, 2, 4>(0, 3);
+  out.row(1) = temp.template select<4, 1, 2, 4>(4, 0);
+  out.row(3) = temp.template select<4, 1, 2, 4>(4, 1);
+  out.row(5) = temp.template select<4, 1, 2, 4>(4, 2);
+  out.row(7) = temp.template select<4, 1, 2, 4>(4, 3);
+}


We can simply include <cm_attention_common.hpp> to get Transpose_8x8.

BTW, there once an optimization to Transpose_16x16 which improves performance a lot. Maybe the similar approach is applicable to Transepose_8x8 too.

ceciliapeng2011 · 2026-02-02T08:20:33Z

src/plugins/intel_gpu/src/graph/impls/cm/pa_single_token.cm

+    const uint seq_idx = get_cm_global_id_2nd(0);
+    const uint kv_head_num_idx = get_cm_global_id_2nd(1) / Q_head_chunks_per_kv_head;
+    const uint head_num_idx = get_cm_global_id_2nd(1) * Q_head_chunk_size;
    //# KV_PARTITION_SIZE --> EU thread
-    const uint wg_thread_id = cm_global_id(2);
+    const uint wg_thread_id = get_cm_global_id_2nd(2);


Why change this part with an invented function?

On A770, cm_global_id() is not accessible/unsupported in our CM environment, so using it breaks compilation/runtime.

ceciliapeng2011 · 2026-02-02T08:24:14Z

src/plugins/intel_gpu/src/graph/impls/cm/pa_single_token_finalization.cm

+        auto batch = cm_get_global_id_2nd(0);
+        auto head = cm_get_global_id_2nd(1);
        auto offset = cm_group_id(2) * REDUCE_SPLIT_SIZE;


Again, why invent this?

… blocks

…afer compile-time selection

WeldonWangwang added 4 commits March 26, 2023 10:18

Enable xattention on xe1

ba8af53

Update 1st token

5f9b15e

Configure BLOCK_SG_M/N for different platforms

799d619

support xe2

82a95c7

github-actions bot added the category: GPU OpenVINO GPU plugin label Dec 18, 2025

Remove debug messages

e0f7b57

peterchen-intel requested a review from ceciliapeng2011 December 21, 2025 09:37

peterchen-intel changed the title ~~enable xattention xe1~~ XAttention for XE1 platform Dec 21, 2025

WeldonWangwang added 13 commits December 29, 2025 14:21

fix WG_N_size=128 mismatch issue

f3f8f51

Fix cm_ptr_store failure by stabilizing quantization path on A770

80e4c13

pass KV cache quant mode into paged attention JIT constants

62c4327

Fix 2nd token with by-token mode

52d16ca

Fix in M tail of smallest query on xe1

c97de30

Update 1st token

f49d7aa

Fix 32k input out-of-resource on xe1

451fc2a

remove debug messages

447c20f

Standardized API

1d1f4d2

Merge branch 'master' into ww/enable_xattention_xe1

ca62a33

modify API

3c642ef

Fix merge_q_num

bee93ee

Fix build error

337c868

WeldonWangwang added do_not_review do_not_merge labels Jan 26, 2026

WeldonWangwang marked this pull request as ready for review January 26, 2026 02:03

WeldonWangwang requested review from a team as code owners January 26, 2026 02:03

Merge branch 'master' into ww/enable_xattention_xe1

a525ed5

WeldonWangwang requested review from peterchen-intel and riverlijunjie January 26, 2026 02:28

riverlijunjie reviewed Jan 27, 2026

View reviewed changes

xe1 non-lsc by-token reduce SVM reads via 4-way K/V tile loads

be5632c

WeldonWangwang force-pushed the ww/enable_xattention_xe1 branch from 5f84164 to be5632c Compare February 2, 2026 01:38

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/pa_multi_token.cm Outdated Show resolved Hide resolved

Merge branch 'master' into ww/enable_xattention_xe1

a53f55a

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/pa_multi_token.cm Outdated Show resolved Hide resolved

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp Outdated Show resolved Hide resolved

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp Outdated Show resolved Hide resolved

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp Outdated Show resolved Hide resolved

Apply suggestion from @ceciliapeng2011

087f050

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp Outdated Show resolved Hide resolved

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp Outdated Show resolved Hide resolved

Apply suggestions from code review

4bab916

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/include/cm_pa_common.hpp Show resolved Hide resolved

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

ceciliapeng2011 requested changes Feb 2, 2026

View reviewed changes

ceciliapeng2011 reviewed Feb 2, 2026

View reviewed changes

ceciliapeng2011 requested changes Feb 2, 2026

View reviewed changes

WeldonWangwang added 4 commits February 3, 2026 23:02

Apply stateful api to access v_cache for u8

3e0ed5f

Refactor sparse mask logic with unified shift-based check for 128/256…

9ee5662

… blocks

use if constexpr instead of macro guards for better readability and s…

70dae17

…afer compile-time selection

Some updates

9e18fe6

		half* prefetch_k_pos = (half)k_cache_base + prefetch_block_id blk_stride + ((prefetch_kv_pos + wg_local_id) % CMPA_BLOCK_SZ) * head_size;
		cm_ptr_prefetch<REG_K/2, DataSize::U32, CacheHint::Cached, CacheHint::Cached>((const unsigned int *const)prefetch_k_pos, 0);

XAttention for XE1 platform #33307

Are you sure you want to change the base?

XAttention for XE1 platform #33307

Conversation

WeldonWangwang commented Dec 18, 2025 • edited by peterchen-intel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ceciliapeng2011 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ceciliapeng2011 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

WeldonWangwang commented Dec 18, 2025 •

edited by peterchen-intel

Loading

ceciliapeng2011 Feb 2, 2026 •

edited

Loading

ceciliapeng2011 Feb 2, 2026 •

edited

Loading

ceciliapeng2011 Feb 2, 2026 •

edited

Loading

ceciliapeng2011 Feb 2, 2026 •

edited

Loading

ceciliapeng2011 Feb 2, 2026 •

edited

Loading