Skip to content

Correct reduce-then-scan -O0 workaround behavior with different host and device optimization levels #2133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 6, 2025

Conversation

mmichel11
Copy link
Contributor

#2046 introduced a workaround for a hardware bug that prevents sub-group sizes of 32 from being used with -O0 compilation on certain devices based on the detection of optimization via the __OPTIMIZE__ macro. However, this approach does not work when the host and device compiler use different optimization levels, specifically when only one of the two phases uses -O0 compilation. The integral sub-group size constant is embedded in the kernel name causing the host to call a kernel not compiled by the device which results in "Kernel not found" errors.

To support these cases where the host and device compiler optimization levels may differ, we need to make the following changes:

  1. Remove the sub-group size from the kernel name. This can be done through removing it as an integral template parameter to the submitters which gets reflected in the kernel naming scheme.
  2. Allocate sufficient temporary storage on the host to handle both sub-group sizes of 16 and 32. We store one partial result per sub-group, so the sub-group size of 16 case requires more temporary storage.
  3. Compute all sub-group specific fields (e.g. __num_sub_groups_local, __num_sub_groups_global, etc.) on the device itself as the true sub-group size can only be determined when on the device.
  4. Conditionally, enable sub-group sizes of 16 when optimization is not enabled AND we are compiling to a SPIR-V (Intel) target GPU. This enables other targets to use a sub-group size of 32 for all cases.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a small suggestion. Other than that, the PR looks good to me.

dmitriy-sobolev
dmitriy-sobolev previously approved these changes Apr 2, 2025
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a couple relatively minor maintenance things I noticed, this LGTM

mmichel11 added 17 commits April 3, 2025 14:51
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
The query of the nd-range in the function call operator caused a perf
regression on NVGPUs.

Signed-off-by: Matthew Michel <[email protected]>
- Adds a function to check device vendor IDs and return candidate sub-group
sizes to the host compiler.
- Use _ONEDPL_DETECT_SPIRV_COMPILATION along with _ONEDPL_DETECT_COMPILER_OPTIMIZATIONS_ENABLED
to choose the sub-group size during device compilation.

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
@mmichel11 mmichel11 force-pushed the dev/mmichel11/remove_sg_sz_template_rts branch from 3ccd466 to b0a7eaf Compare April 4, 2025 14:04
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming green CI follows

@mmichel11
Copy link
Contributor Author

Will run through internal CI once more to confirm there are no issues and will then merge.

@mmichel11 mmichel11 merged commit b34453d into main Apr 6, 2025
18 of 19 checks passed
@mmichel11 mmichel11 deleted the dev/mmichel11/remove_sg_sz_template_rts branch April 6, 2025 15:28
timmiesmith pushed a commit that referenced this pull request Jun 9, 2025
…and device optimization levels (#2133)

The host and device compiler may use different optimization levels causing the value of _ONEDPL_DETECT_COMPILER_OPTIMIZATIONS_ENABLED to differ between host and device code. This patch corrects this issue in the reduce-then-scan sub-group size workaround to ensure correctness and avoid missing kernel runtime exceptions.

---------

Signed-off-by: Matthew Michel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants