-
Notifications
You must be signed in to change notification settings - Fork 99
INT4 GEMV OpenCL Kernel for Adreno GPU Compatibility #3569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR updates the INT4 GEMV OpenCL kernel to ensure compatibility with the Adreno GPU while maintaining the Intel intrinsics. **Self-evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>
| ret.s0 = ptr[idx]; \ | ||
| idx += get_max_sub_group_size(); \ | ||
| ret.s1 = ptr[idx]; \ | ||
| idx += get_max_sub_group_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful with such #define.
Someone may do...:
if (a == b)
BLOCK_READ_IMPL_2;
else
BLOCK_READ_IMPL_4;
which will break everything.
https://stackoverflow.com/questions/923822/whats-the-use-of-do-while0-when-we-define-a-macro
| barrier(CLK_LOCAL_MEM_FENCE); | ||
|
|
||
| uint sg_size = SUBGROUP_SIZE; // expected 16 | ||
| for (uint stride = sg_size >> 1; stride > 0; stride >>= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it:
shouldn't
if (get_subgroup_local_id() != 0)
{
for (uint i = 1; i <SUBGROUP_SIZE; ++i)
{
buf_line[0] = buf_line[0] + buf_line[i];
}
}
as we would want to reduce all no just some.
Simply don't get it what's going on with this stride.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh think I get it, to make it work the way you want the reduction accumulation local buf_line[0] + to strided element should be atomic (local atomit fetch add) otherwise it races between subgroup threads. (race between is on buf_line[0] and modifiaction of buf_line[0])
Have you tested correctness of this reduction?
|
Hello @piotrrak, I agree that only #define SLM_BLOCK_READ_FLOAT(ptr_) \
as_float(intel_sub_group_block_read((const __local uint *)(ptr_)))Do you have any suggestions for implementing this in non-Intel extensions? |
|
https://github.com/pocl/pocl/blob/6842d4a2de5f568d17cbfc30657016662ab7d6c9/lib/kernel/subgroups.cl#L443 might refer to that, that's the other opensource implementation of block_read block_write and might be good starting point. As for supporting very old adreno targets, subgroup reduction and broadcast are basically correspondingly |

This PR updates the INT4 GEMV OpenCL kernel to ensure compatibility with the Adreno GPU while maintaining the Intel intrinsics.
Self-evaluation: