-
Notifications
You must be signed in to change notification settings - Fork 116
Correct reduce-then-scan -O0 workaround behavior with different host and device optimization levels #2133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a small suggestion. Other than that, the PR looks good to me.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a couple relatively minor maintenance things I noticed, this LGTM
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
…ismatch Signed-off-by: Matthew Michel <[email protected]>
…ithin kernel Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
The query of the nd-range in the function call operator caused a perf regression on NVGPUs. Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
- Adds a function to check device vendor IDs and return candidate sub-group sizes to the host compiler. - Use _ONEDPL_DETECT_SPIRV_COMPILATION along with _ONEDPL_DETECT_COMPILER_OPTIMIZATIONS_ENABLED to choose the sub-group size during device compilation. Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…orkaround_sg_sz Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
3ccd466
to
b0a7eaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming green CI follows
Will run through internal CI once more to confirm there are no issues and will then merge. |
…and device optimization levels (#2133) The host and device compiler may use different optimization levels causing the value of _ONEDPL_DETECT_COMPILER_OPTIMIZATIONS_ENABLED to differ between host and device code. This patch corrects this issue in the reduce-then-scan sub-group size workaround to ensure correctness and avoid missing kernel runtime exceptions. --------- Signed-off-by: Matthew Michel <[email protected]>
#2046 introduced a workaround for a hardware bug that prevents sub-group sizes of 32 from being used with
-O0
compilation on certain devices based on the detection of optimization via the__OPTIMIZE__
macro. However, this approach does not work when the host and device compiler use different optimization levels, specifically when only one of the two phases uses-O0
compilation. The integral sub-group size constant is embedded in the kernel name causing the host to call a kernel not compiled by the device which results in "Kernel not found" errors.To support these cases where the host and device compiler optimization levels may differ, we need to make the following changes:
__num_sub_groups_local
,__num_sub_groups_global
, etc.) on the device itself as the true sub-group size can only be determined when on the device.