Skip to content

Conversation

@djeong20
Copy link
Contributor

This PR updates the INT4 GEMV OpenCL kernel to ensure compatibility with the Adreno GPU while maintaining the Intel intrinsics.

Self-evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

This PR updates the INT4 GEMV OpenCL kernel to ensure compatibility with the Adreno GPU while maintaining the Intel intrinsics.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test:   [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <[email protected]>
ret.s0 = ptr[idx]; \
idx += get_max_sub_group_size(); \
ret.s1 = ptr[idx]; \
idx += get_max_sub_group_size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful with such #define.

Someone may do...:

if (a == b)
  BLOCK_READ_IMPL_2;
else
  BLOCK_READ_IMPL_4;

which will break everything.

https://stackoverflow.com/questions/923822/whats-the-use-of-do-while0-when-we-define-a-macro

barrier(CLK_LOCAL_MEM_FENCE);

uint sg_size = SUBGROUP_SIZE; // expected 16
for (uint stride = sg_size >> 1; stride > 0; stride >>= 1) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it:

shouldn't

if (get_subgroup_local_id() != 0)
{
   for (uint i = 1; i <SUBGROUP_SIZE; ++i)
   {
       buf_line[0] = buf_line[0] + buf_line[i];
   }
}

as we would want to reduce all no just some.
Simply don't get it what's going on with this stride.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh think I get it, to make it work the way you want the reduction accumulation local buf_line[0] + to strided element should be atomic (local atomit fetch add) otherwise it races between subgroup threads. (race between is on buf_line[0] and modifiaction of buf_line[0])

Have you tested correctness of this reduction?

@piotrrak
Copy link

piotrrak commented Nov 20, 2025

According to this document qualcomm gpus support subgroup operations, as both cl_khr_subgroup_* as part of OpenCL 3.0 and cl_qcom_subgroups_*
Only part that should be reimplemented is cl_intel_subgroup_local_block_io

https://docs.qualcomm.com/bundle/publicresource/80-NB295-11_REV_C_Qualcomm_Snapdragon_Mobile_Platform_Opencl_General_Programming_and_Optimization.pdf

Please refer to chapter 8.9 and 9.2 iof above document. I don't see the point in reimplementing those, as they are supported by non-intel extensions. Reusing those should simplify debugging code.

In any case subgroup_reduce_add could be implemented as shuffle_down operation correctly.
image

@djeong20
Copy link
Contributor Author

Hello @piotrrak, I agree that only cl_intel_subgroup_local_block_io should be implemented.

#define SLM_BLOCK_READ_FLOAT(ptr_)                                             \
  as_float(intel_sub_group_block_read((const __local uint *)(ptr_)))

Do you have any suggestions for implementing this in non-Intel extensions?

@piotrrak
Copy link

https://github.com/pocl/pocl/blob/6842d4a2de5f568d17cbfc30657016662ab7d6c9/lib/kernel/subgroups.cl#L443 might refer to that, that's the other opensource implementation of block_read block_write and might be good starting point. As for supporting very old adreno targets, subgroup reduction and broadcast are basically correspondingly shuffle_down and shuffle_up and those primitives were present by qcom_subgroups extension on even 5 yrs old targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants