INT4 GEMV OpenCL Kernel for Adreno GPU Compatibility #3569

djeong20 · 2025-11-18T07:39:01Z

This PR updates the INT4 GEMV OpenCL kernel to ensure compatibility with the Adreno GPU while maintaining the Intel intrinsics.

Self-evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

This PR updates the INT4 GEMV OpenCL kernel to ensure compatibility with the Adreno GPU while maintaining the Intel intrinsics. **Self-evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>

myungjoo · 2025-11-19T08:28:31Z

nntrainer/tensor/cl_operations/cl_kernels/int4_gemv.cl

+  ret.s0 = ptr[idx];                                                           \
+  idx += get_max_sub_group_size();                                             \
+  ret.s1 = ptr[idx];                                                           \
+  idx += get_max_sub_group_size();


Be careful with such #define.

Someone may do...:

if (a == b) BLOCK_READ_IMPL_2; else BLOCK_READ_IMPL_4;

which will break everything.

https://stackoverflow.com/questions/923822/whats-the-use-of-do-while0-when-we-define-a-macro

piotrrak · 2025-11-20T10:45:11Z

nntrainer/tensor/cl_operations/cl_kernels/int4_gemv.cl

+  barrier(CLK_LOCAL_MEM_FENCE);
+
+  uint sg_size = SUBGROUP_SIZE; // expected 16
+  for (uint stride = sg_size >> 1; stride > 0; stride >>= 1) {


I don't get it:

shouldn't

if (get_subgroup_local_id() != 0) { for (uint i = 1; i <SUBGROUP_SIZE; ++i) { buf_line[0] = buf_line[0] + buf_line[i]; } }

as we would want to reduce all no just some.
Simply don't get it what's going on with this stride.

Oh think I get it, to make it work the way you want the reduction accumulation local buf_line[0] + to strided element should be atomic (local atomit fetch add) otherwise it races between subgroup threads. (race between is on buf_line[0] and modifiaction of buf_line[0])

Have you tested correctness of this reduction?

piotrrak · 2025-11-20T15:01:42Z

According to this document qualcomm gpus support subgroup operations, as both cl_khr_subgroup_* as part of OpenCL 3.0 and cl_qcom_subgroups_*
Only part that should be reimplemented is cl_intel_subgroup_local_block_io

https://docs.qualcomm.com/bundle/publicresource/80-NB295-11_REV_C_Qualcomm_Snapdragon_Mobile_Platform_Opencl_General_Programming_and_Optimization.pdf

Please refer to chapter 8.9 and 9.2 iof above document. I don't see the point in reimplementing those, as they are supported by non-intel extensions. Reusing those should simplify debugging code.

In any case subgroup_reduce_add could be implemented as shuffle_down operation correctly.

djeong20 · 2025-11-21T04:09:42Z

Hello @piotrrak, I agree that only cl_intel_subgroup_local_block_io should be implemented.

#define SLM_BLOCK_READ_FLOAT(ptr_)                                             \
  as_float(intel_sub_group_block_read((const __local uint *)(ptr_)))

Do you have any suggestions for implementing this in non-Intel extensions?

piotrrak · 2025-11-21T07:28:11Z

https://github.com/pocl/pocl/blob/6842d4a2de5f568d17cbfc30657016662ab7d6c9/lib/kernel/subgroups.cl#L443 might refer to that, that's the other opensource implementation of block_read block_write and might be good starting point. As for supporting very old adreno targets, subgroup reduction and broadcast are basically correspondingly shuffle_down and shuffle_up and those primitives were present by qcom_subgroups extension on even 5 yrs old targets.

djeong20 added the GPU label Nov 18, 2025

github-actions bot added the Need Review label Nov 18, 2025

myungjoo reviewed Nov 19, 2025

View reviewed changes

piotrrak reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

INT4 GEMV OpenCL Kernel for Adreno GPU Compatibility #3569

INT4 GEMV OpenCL Kernel for Adreno GPU Compatibility #3569

Uh oh!

djeong20 commented Nov 18, 2025

Uh oh!

myungjoo Nov 19, 2025

Uh oh!

piotrrak Nov 20, 2025

Uh oh!

piotrrak Nov 20, 2025

Uh oh!

piotrrak commented Nov 20, 2025 •

edited

Loading

Uh oh!

djeong20 commented Nov 21, 2025

Uh oh!

piotrrak commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

INT4 GEMV OpenCL Kernel for Adreno GPU Compatibility #3569

Are you sure you want to change the base?

INT4 GEMV OpenCL Kernel for Adreno GPU Compatibility #3569

Uh oh!

Conversation

djeong20 commented Nov 18, 2025

Uh oh!

myungjoo Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

piotrrak Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

piotrrak Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

piotrrak commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djeong20 commented Nov 21, 2025

Uh oh!

piotrrak commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

piotrrak commented Nov 20, 2025 •

edited

Loading