Correct reduce-then-scan -O0 workaround behavior with different host and device optimization levels #2133

mmichel11 · 2025-03-17T17:47:44Z

#2046 introduced a workaround for a hardware bug that prevents sub-group sizes of 32 from being used with -O0 compilation on certain devices based on the detection of optimization via the __OPTIMIZE__ macro. However, this approach does not work when the host and device compiler use different optimization levels, specifically when only one of the two phases uses -O0 compilation. The integral sub-group size constant is embedded in the kernel name causing the host to call a kernel not compiled by the device which results in "Kernel not found" errors.

To support these cases where the host and device compiler optimization levels may differ, we need to make the following changes:

Remove the sub-group size from the kernel name. This can be done through removing it as an integral template parameter to the submitters which gets reflected in the kernel naming scheme.
Allocate sufficient temporary storage on the host to handle both sub-group sizes of 16 and 32. We store one partial result per sub-group, so the sub-group size of 16 case requires more temporary storage.
Compute all sub-group specific fields (e.g. __num_sub_groups_local, __num_sub_groups_global, etc.) on the device itself as the true sub-group size can only be determined when on the device.
Conditionally, enable sub-group sizes of 16 when optimization is not enabled AND we are compiling to a SPIR-V (Intel) target GPU. This enables other targets to use a sub-group size of 32 for all cases.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

dmitriy-sobolev

I've left a small suggestion. Other than that, the PR looks good to me.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

dmitriy-sobolev

LGTM

danhoeflinger

Other than a couple relatively minor maintenance things I noticed, this LGTM

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

…ismatch Signed-off-by: Matthew Michel <[email protected]>

…ithin kernel Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

The query of the nd-range in the function call operator caused a perf regression on NVGPUs. Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

- Adds a function to check device vendor IDs and return candidate sub-group sizes to the host compiler. - Use _ONEDPL_DETECT_SPIRV_COMPILATION along with _ONEDPL_DETECT_COMPILER_OPTIMIZATIONS_ENABLED to choose the sub-group size during device compilation. Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

…orkaround_sg_sz Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

danhoeflinger

LGTM assuming green CI follows

mmichel11 · 2025-04-04T14:33:06Z

Will run through internal CI once more to confirm there are no issues and will then merge.

…and device optimization levels (#2133) The host and device compiler may use different optimization levels causing the value of _ONEDPL_DETECT_COMPILER_OPTIMIZATIONS_ENABLED to differ between host and device code. This patch corrects this issue in the reduce-then-scan sub-group size workaround to ensure correctness and avoid missing kernel runtime exceptions. --------- Signed-off-by: Matthew Michel <[email protected]>

mmichel11 added this to the 2022.9.0 milestone Mar 17, 2025

mmichel11 mentioned this pull request Mar 17, 2025

Investigate removing the sycl::reqd_sub_group_size kernel attribute in reduce-then-scan kernels #2134

Open

mmichel11 marked this pull request as ready for review March 17, 2025 19:55

mmichel11 requested review from akukanov, dmitriy-sobolev, SergeyKopienko, danhoeflinger and adamfidel March 17, 2025 19:56

dmitriy-sobolev reviewed Mar 27, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Outdated Show resolved Hide resolved

dmitriy-sobolev reviewed Apr 1, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Outdated Show resolved Hide resolved

dmitriy-sobolev reviewed Apr 2, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Outdated Show resolved Hide resolved

dmitriy-sobolev previously approved these changes Apr 2, 2025

View reviewed changes

danhoeflinger reviewed Apr 3, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Outdated Show resolved Hide resolved

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Outdated Show resolved Hide resolved

mmichel11 added 17 commits April 3, 2025 14:51

Make __sub_group_size a static constexpr field to avoid kernel name m…

297a4de

…ismatch Signed-off-by: Matthew Michel <[email protected]>

Allocate sufficient memory for all cases and move work distribution w…

c0fef1d

…ithin kernel Signed-off-by: Matthew Michel <[email protected]>

Cleanup

dfa5b5a

Signed-off-by: Matthew Michel <[email protected]>

Bugfix

107ae10

Signed-off-by: Matthew Michel <[email protected]>

Only update __inputs_per_item on last iter and clang-format

fcc13a5

Signed-off-by: Matthew Michel <[email protected]>

Make __work_group_size a member of submitter classes

bef5aa2

The query of the nd-range in the function call operator caused a perf regression on NVGPUs. Signed-off-by: Matthew Michel <[email protected]>

Add differentiation for non-SPIRV targets

82fd79a

Signed-off-by: Matthew Michel <[email protected]>

Remove vendor id check

df9cec1

Signed-off-by: Matthew Michel <[email protected]>

Refactor reduce-then-scan sub-group size query

5c8b895

Signed-off-by: Matthew Michel <[email protected]>

Remove vendor ID field

383b383

Signed-off-by: Matthew Michel <[email protected]>

Remove unused variables and consolidate duplicated code in utility

5134168

Signed-off-by: Matthew Michel <[email protected]>

Correct update of __inputs_remaining and add additional comment

dcaff88

Signed-off-by: Matthew Michel <[email protected]>

clang-format

91ad0e5

Signed-off-by: Matthew Michel <[email protected]>

Split __get_reduce_then_scan_sg_sz into a separate host and device call

18a72c5

Signed-off-by: Matthew Michel <[email protected]>

Add missing <tuple> header

e62f6c1

Signed-off-by: Matthew Michel <[email protected]>

Add __get_reduce_then_scan_default_sg_sz and __get_reduce_then_scan_w…

1453202

…orkaround_sg_sz Signed-off-by: Matthew Michel <[email protected]>

Introduce a __reduce_then_scan_sub_group_params struct

b0a7eaf

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 dismissed dmitriy-sobolev’s stale review via b0a7eaf April 4, 2025 14:04

mmichel11 force-pushed the dev/mmichel11/remove_sg_sz_template_rts branch from 3ccd466 to b0a7eaf Compare April 4, 2025 14:04

danhoeflinger approved these changes Apr 4, 2025

View reviewed changes

mmichel11 merged commit b34453d into main Apr 6, 2025
18 of 19 checks passed

mmichel11 deleted the dev/mmichel11/remove_sg_sz_template_rts branch April 6, 2025 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct reduce-then-scan -O0 workaround behavior with different host and device optimization levels #2133

Correct reduce-then-scan -O0 workaround behavior with different host and device optimization levels #2133

Uh oh!

mmichel11 commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

dmitriy-sobolev left a comment

Uh oh!

Uh oh!

dmitriy-sobolev left a comment

Uh oh!

danhoeflinger left a comment

Uh oh!

Uh oh!

Uh oh!

danhoeflinger left a comment

Uh oh!

mmichel11 commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!

Correct reduce-then-scan -O0 workaround behavior with different host and device optimization levels #2133

Correct reduce-then-scan -O0 workaround behavior with different host and device optimization levels #2133

Uh oh!

Conversation

mmichel11 commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

dmitriy-sobolev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmitriy-sobolev left a comment

Choose a reason for hiding this comment

Uh oh!

danhoeflinger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danhoeflinger left a comment

Choose a reason for hiding this comment

Uh oh!

mmichel11 commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!