-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA][HIP] Throw a runtime error with invalid sub-group size to kernel #6103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CUDA only currently supports one subgroup (warp) size : 32 for all devices.
This statement is unclear for me. Should a SYCL implementation be able to compile device code using A problem with such code being able to compile even when a reqd subgroup size that is specified with I think that #6183 introduces a sensible solution for backends which only support a single subgroup size: if
The other alternative would be to allow
With some other invalid subgroup sizes the code compiles and executes with no errors. I am not sure if the implementation is complete (i.e. For the HIP AMD case older devices only support subgroup (wavefront) of 64, but newer "RDNA" cards also support 32. @npmiller do you know if DPC++ HIP AMD supports both subgroup sizes for the "RDNA" cards? |
We do support both sizes, in fact RDNA cards will default to the 32 thread wavefront size, and I believe that's what the CI hardware does. |
So the motivation for the spec to require that all SYCL applications compile, even if a device may not support a feature at runtime, is for portability, so an application will always compile as long as it's valid SYCL code. I agree it does seem a little counter intuitive for It might be worth emitting a compiler warning for this, to help catch the case where an invalid sub-group size is being used by mistake rather than intentionally? |
Yep Compiler warning added here: #6183 |
Update:
In the meantime A warning has been added for the CUDA case in this PR: #6183 |
I think, I came across invalid sub-group size set to the kernel attributes error at RT when running on a Intel Gen9, ATS, PVC devices with L0 backend specifically. Might be worth taking a peek at L0 ? |
@gmlueck Do you know if this is the case? Could you clarify the current support for |
This is a sample output from SYCL on Intel Gen9 (support SG size: 8,16,32) using L0 backend. abagusetty@iris09:~/soft/sycl_req_sub_group_size$ dpcpp sycl_free_mem.cpp
abagusetty@iris09:~/soft/sycl_req_sub_group_size$ ./a.out
Number of Root GPUs: 1
Found Root GPU-ID: Intel(R) Iris(R) Pro Graphics P580 [0x193a]
Supported sub-group sizes: 81632
terminate called after throwing an instance of 'cl::sycl::compile_program_error'
what(): The program was built for 1 devices
Build program log for 'Intel(R) Iris(R) Pro Graphics P580 [0x193a]':
error: Unsupported required sub group size
in kernel: 'typeinfo name for main::'lambda'(cl::sycl::item<1, true>)'
error: backend compiler failed build.
-11 (CL_BUILD_PROGRAM_FAILURE)
Aborted |
That's at compile time right? Same error that I observed for the OpenCL backend (incidentally using the same sub_group size, 7). I also observed that if you try very large subgroup sizes (> 64 I think) or very small subgroup sizes (e.g. size 1) then OpenCL backend will compile and run without any errors, presumably with a default subgroup size. |
The error was at Runtime, the compilation didn't really result in any warning or error. Also I check with using |
Forgive me if I'm just saying something you already know. The backend compiler runs when the app runs. The app loads up the kernel, sees what device it is being asked to run the kernel on, and then compiles it for that device. If you want to run the backend compiler in advance, use the |
I think the current behavior is that you get a JIT-time compilation error if a kernel is decorated with We expect to fix this problem and change the runtime to throw the |
This is a solution to #6103 for the CUDA case only. HIP AMD case still needs to be considered as discussed here: #6103 (comment). CUDA only currently supports one subgroup (warp) size : 32 for all devices. This PR introduces a solution to #6103 appropriate for backends which only support a single subgroup size: if the optional kernel attribute reqd_sub_group_size() is used with the supported subgroup size then it will compile and behave as the programmer intends. If reqd_sub_group_size() is used with another incompatible subgroup size a warning is returned when compiling, such as: reqd-sub-group-size-cuda.cpp:12:73: warning: attribute argument 8 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat] h.single_task<class invalid_kernel>([=] [[sycl::reqd_sub_group_size(8)]] {}); ^ Signed-off-by: JackAKirk [email protected]
I've removed the CUDA label although we can keep this issue open: for HIP and other backends we should wait for the more general fix that's coming using the OptionalDeviceFeatures. A simple warning has been added for CUDA since it's a simple case where only a single subgroup size exists. |
@JackAKirk A simple warning for now is plenty. Thanks! |
Are there any updates on fixing this?
|
I believe this should be fixed now in the intel/llvm repo, but only for JIT (SPIR-V) compilations. Are you still seeing an error without Note that we do not yet have the implementation for AOT compilation ( |
$ clang++ --version
clang version 17.0.0 (https://github.com/intel/llvm 9cab102bf9c6f2f0dfb0113ab1091a3982e746e6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/aland/intel-sycl/llvm/build/install//bin
$ clang++ -fsycl sg.cpp -o sg && ONEAPI_DEVICE_SELECTOR=opencl:gpu ./sg
Calling 16
terminate called after throwing an instance of 'sycl::_V1::compile_program_error'
what(): The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
error: Unsupported required sub group size
in kernel: 'typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<Kernel<64> >'
error: backend compiler failed build.
-11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
Aborted (core dumped) #include <sycl/sycl.hpp>
#include <vector>
template <int subGroupSize> class Kernel;
template <int subGroupSize> void run_kernel(const sycl::device &syclDevice) {
static const int numThreads = 64;
std::cout << "Calling " << subGroupSize << std::endl;
sycl::queue queue = sycl::queue(syclDevice);
auto buf = sycl::malloc_device<float>(1, queue);
queue
.submit([&](sycl::handler &cgh) {
cgh.parallel_for<Kernel<subGroupSize>>(
sycl::range<1>{numThreads}, [=
](sycl::id<1> threadId) [[sycl::reqd_sub_group_size(subGroupSize)]] {
buf[0] = 1;
});
})
.wait_and_throw();
std::cout << " Done!" << std::endl;
}
int main() {
std::vector<sycl::device> devices = sycl::device::get_devices();
for (const auto &dev : devices) {
std::vector<size_t> subGroupSizes;
subGroupSizes = dev.get_info<sycl::info::device::sub_group_sizes>();
if (subGroupSizes[0] == 32) {
run_kernel<32>(dev);
} else if (subGroupSizes[0] == 64) {
run_kernel<64>(dev);
} else {
assert(std::find(subGroupSizes.begin(), subGroupSizes.end(), 16) !=
subGroupSizes.end());
run_kernel<16>(dev);
}
}
} |
@al42and This is interesting. I believe DG1 cards are mostly 8,16, and 32. Can you please verify that this RT error is only getting triggered for 16 only and not for 8 and 32. Also could be a limitation from OpenCL plugin. Would you be able to verify this for L0 plugin as well |
I believe it's not related to PI but to how JIT is called. Same behavior with L0, I'm invoking 16-wide flavor, but get an error related to (instantiated, but not called) 64-wide flavor: $ clang++ -fsycl sg.cpp -o sg && ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./sg
Calling 16
terminate called after throwing an instance of 'sycl::_V1::compile_program_error'
what(): The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
error: Unsupported required sub group size
in kernel: 'typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<Kernel<64> >'
error: backend compiler failed build.
-11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
Aborted (core dumped)
If I only have 8, 16, and 32-wide kernels, everything works fine since the device supports all of them. |
It is WIP - module splitting based on aspects (e.g. |
Describe the bug
Using an invalid sub-group size to the kernel doesn't result in a runtime error. Possibly throw a RT error as seen on Intel devices with invalid sub-group sizes. This behavior was observed for both the CUDA and HIP Plugins.
For CUDA possibly 32, for HIP possibly 64.
From the SYCL spec:
Each device supports only certain sub-group sizes as defined by info::device::sub_group_sizes. In addition, some device features may be incompatible with certain sub-group sizes. If a kernel is decorated with this attribute and then submitted to a device that does not support the sub-group size or if the kernel uses a feature that the device does not support with this sub-group size, the implementation must throw a synchronous exception with the errc::kernel_not_supported error code.
To Reproduce
Environment (please complete the following information):
CUDA PI built with CUDA-11.6.2, HIP PI built with room-5.1.0
llvm compiler: ac6a4f5
The text was updated successfully, but these errors were encountered: