Skip to content

[SYCL][CUDA][HIP] Throw a runtime error with invalid sub-group size to kernel #6103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
abagusetty opened this issue May 5, 2022 · 21 comments
Labels
bug Something isn't working hip Issues related to execution on HIP backend. runtime Runtime library related issue

Comments

@abagusetty
Copy link
Contributor

Describe the bug
Using an invalid sub-group size to the kernel doesn't result in a runtime error. Possibly throw a RT error as seen on Intel devices with invalid sub-group sizes. This behavior was observed for both the CUDA and HIP Plugins.

For CUDA possibly 32, for HIP possibly 64.

From the SYCL spec:
Each device supports only certain sub-group sizes as defined by info::device::sub_group_sizes. In addition, some device features may be incompatible with certain sub-group sizes. If a kernel is decorated with this attribute and then submitted to a device that does not support the sub-group size or if the kernel uses a feature that the device does not support with this sub-group size, the implementation must throw a synchronous exception with the errc::kernel_not_supported error code.

To Reproduce

#include <sycl/sycl.hpp>

int main() {
  auto const& gpu_devices = sycl::device::get_devices(sycl::info::device_type::gpu);
  std::cout << "Number of Root GPUs: " << gpu_devices.size() << std::endl;

  for(const auto& d : gpu_devices) {
    std::cout << "Found Root GPU-ID: " << d.get_info<sycl::info::device::name>() << std::endl;
    std::vector<size_t> sg_sizes = d.get_info<sycl::info::device::sub_group_sizes>();
    std::cout << "Supported sub-group sizes: ";
    for (int i=0; i<sg_sizes.size(); i++) {
      std::cout << sg_sizes[i];
    }
    std::cout << std::endl;
  }

  const int N{1024};
  sycl::queue Q{sycl::gpu_selector{}};
  int* ptr = sycl::malloc_device<int>(N, Q);
  Q.parallel_for(N, [=](sycl::item<1> id) [[sycl::reqd_sub_group_size(8)]] { ptr[id] = id; }).wait();
  return 0;
}

Environment (please complete the following information):
CUDA PI built with CUDA-11.6.2, HIP PI built with room-5.1.0
llvm compiler: ac6a4f5

@abagusetty abagusetty added the bug Something isn't working label May 5, 2022
@bader bader added cuda CUDA back-end hip Issues related to execution on HIP backend. labels May 7, 2022
@AerialMantis AerialMantis added the runtime Runtime library related issue label May 8, 2022
@JackAKirk
Copy link
Contributor

JackAKirk commented May 23, 2022

CUDA only currently supports one subgroup (warp) size : 32 for all devices.
As detailed in 5.7 of the SYCL 2020 Revision 5 spec, reqd_sub_group_size() is an optional kernel feature which means it does not have to be supported for all devices but the spec states:

In order to guarantee source code portability of SYCL applications that use optional kernel features, all SYCL implementations must be able to compile device code that uses these optional features regardless of whether the implementation supports the features on any of its devices.

This statement is unclear for me. Should a SYCL implementation be able to compile device code using reqd_sub_group_size for all backends, or only those in which reqd_sub_group_size can be sensibly implemented/has a use.

A problem with such code being able to compile even when a reqd subgroup size that is specified with reqd_sub_group_size() is invalid for the given device (that doesn't support the reqd_sub_group_size optional kernel feature), is that code might then run but give unexpected results if the user is assuming a different subgroup size to the one actually used on the device: an example is: #5971.

I think that #6183 introduces a sensible solution for backends which only support a single subgroup size: if reqd_sub_group_size() is used with the supported subgroup size then it will compile and behave as the programmer intends. If reqd_sub_group_size() is used with another incompatible subgroup size the following error is returned when compiling:

invalid_sg_orig.cpp:20:21: error: CUDA backend requires subgroup size 32
  Q.parallel_for(N, [=](sycl::item<1> id) [[sycl::reqd_sub_group_size(8)]] { ptr[id] = id; }).wait();
                    ^
1 error generated.

The other alternative would be to allow reqd_sub_group_size with an invalid subgroup size to compile but throw errc::ker­nel_not_supported at runtime. A potential technical difficulty with this is that, since CUDA only supports a single subgroup and doesn't throw any corresponding device errors for an incorrect subgroup size, it isn't possible to progagate an error from the device to the runtime via the PI_Plugin (which may be how it would be dealt with in other backends). The only other solution I imagine would be to attached a subgroup attribute to a kernel, that can be checked prior to a kernel submission. This brings me to another point: I observe that, using OpenCL CPU with the sycl::reqd_sub_group_size kernel label using some invalid subgroup sizes, I get the following:

Failed to build: : -11 (CL_BUILD_PROGRAM_FAILURE)

llvm-foreach: 
clang-15: error: x86_64 compiler command failed with exit code 245 (use -v to see invocation)

With some other invalid subgroup sizes the code compiles and executes with no errors. I am not sure if the implementation is complete (i.e. reqd_sub_group_size(val) leads to errc::ker­nel_not_supported if val is an unsupported size) on another backend that I don't have access to.

For the HIP AMD case older devices only support subgroup (wavefront) of 64, but newer "RDNA" cards also support 32. @npmiller do you know if DPC++ HIP AMD supports both subgroup sizes for the "RDNA" cards?
If DPC++ only currently supports the older 64 subgroup size on all devices then it may be sensible to apply a similar solution to #6183 for the HIP AMD, until a time when both subgroup sizes are supported and we should also implement reqd_sub_group_size for the HIP AMD backend? At the moment the metadata that is attached via reqd_sub_group_size is apparently ignored in the HIP AMD backend and only used in the OpenCL backend. Presumably the level_zero backend also does not support reqd_sub_group_size at this time, and would result in similar issues to those raised here?

@npmiller
Copy link
Contributor

npmiller commented May 23, 2022

We do support both sizes, in fact RDNA cards will default to the 32 thread wavefront size, and I believe that's what the CI hardware does.

@AerialMantis
Copy link
Contributor

So the motivation for the spec to require that all SYCL applications compile, even if a device may not support a feature at runtime, is for portability, so an application will always compile as long as it's valid SYCL code. I agree it does seem a little counter intuitive for sycl::reqd_sub_group_size to compile for sub-groups sizes which are known to be invalid since this will fail at runtime and the compiler may know that it cannot support that size, however, you may want to compile for multiple backends so in that case the compiler can't know which binary will be used.

It might be worth emitting a compiler warning for this, to help catch the case where an invalid sub-group size is being used by mistake rather than intentionally?

@JackAKirk
Copy link
Contributor

So the motivation for the spec to require that all SYCL applications compile, even if a device may not support a feature at runtime, is for portability, so an application will always compile as long as it's valid SYCL code. I agree it does seem a little counter intuitive for sycl::reqd_sub_group_size to compile for sub-groups sizes which are known to be invalid since this will fail at runtime and the compiler may know that it cannot support that size, however, you may want to compile for multiple backends so in that case the compiler can't know which binary will be used.

It might be worth emitting a compiler warning for this, to help catch the case where an invalid sub-group size is being used by mistake rather than intentionally?

Yep Compiler warning added here: #6183

@JackAKirk
Copy link
Contributor

JackAKirk commented May 24, 2022

Update:

reqd_sub_group_size() kernel attribute does not have a complete implementation in any backend: it does not throw the errc::kernel_not_­supported for invalid cases as described in the SYCL 2020 spec in any backend. It sounds like there is a plan to implement a general solution for this in the works.

In the meantime reqd_sub_group_size() appears to have a partially working implementation for OpenCL (I did not test it fully: I described my experiments in an earlier comment), but for other backends the kernel metadata is unused.

A warning has been added for the CUDA case in this PR: #6183
Hopefully this will warn users of potential unexpected behaviour when attempting to port OpenCL code to CUDA that makes use of the reqd_sub_group_size kernel attribute.

@abagusetty
Copy link
Contributor Author

I think, I came across invalid sub-group size set to the kernel attributes error at RT when running on a Intel Gen9, ATS, PVC devices with L0 backend specifically. Might be worth taking a peek at L0 ?

@JackAKirk
Copy link
Contributor

I think, I came across invalid sub-group size set to the kernel attributes error at RT when running on a Intel Gen9, ATS, PVC devices with L0 backend specifically. Might be worth taking a peek at L0 ?

@gmlueck Do you know if this is the case? Could you clarify the current support for reqd_sub_group_size() in L0?
Thanks

@abagusetty
Copy link
Contributor Author

abagusetty commented May 24, 2022

This is a sample output from SYCL on Intel Gen9 (support SG size: 8,16,32) using L0 backend.
Error generated with: Q.parallel_for(N, [=](sycl::item<1> id) [[sycl::reqd_sub_group_size(7)]] { ptr[id] = id; }).wait();

abagusetty@iris09:~/soft/sycl_req_sub_group_size$ dpcpp sycl_free_mem.cpp
abagusetty@iris09:~/soft/sycl_req_sub_group_size$ ./a.out 
Number of Root GPUs: 1
Found Root GPU-ID: Intel(R) Iris(R) Pro Graphics P580 [0x193a]
Supported sub-group sizes: 81632
terminate called after throwing an instance of 'cl::sycl::compile_program_error'
  what():  The program was built for 1 devices
Build program log for 'Intel(R) Iris(R) Pro Graphics P580 [0x193a]':

error: Unsupported required sub group size
in kernel: 'typeinfo name for main::'lambda'(cl::sycl::item<1, true>)'
error: backend compiler failed build.
 -11 (CL_BUILD_PROGRAM_FAILURE)
Aborted

@JackAKirk
Copy link
Contributor

JackAKirk commented May 24, 2022

This is a sample output from SYCL on Intel Gen9 (support SG size: 8,16,32) using L0 backend. Error generated with: Q.parallel_for(N, [=](sycl::item<1> id) [[sycl::reqd_sub_group_size(7)]] { ptr[id] = id; }).wait();

abagusetty@iris09:~/soft/sycl_req_sub_group_size$ ./a.out 
Number of Root GPUs: 1
Found Root GPU-ID: Intel(R) Iris(R) Pro Graphics P580 [0x193a]
Supported sub-group sizes: 81632
terminate called after throwing an instance of 'cl::sycl::compile_program_error'
  what():  The program was built for 1 devices
Build program log for 'Intel(R) Iris(R) Pro Graphics P580 [0x193a]':

error: Unsupported required sub group size
in kernel: 'typeinfo name for main::'lambda'(cl::sycl::item<1, true>)'
error: backend compiler failed build.
 -11 (CL_BUILD_PROGRAM_FAILURE)
Aborted

That's at compile time right? Same error that I observed for the OpenCL backend (incidentally using the same sub_group size, 7). I also observed that if you try very large subgroup sizes (> 64 I think) or very small subgroup sizes (e.g. size 1) then OpenCL backend will compile and run without any errors, presumably with a default subgroup size.

@abagusetty
Copy link
Contributor Author

That's at compile time right? Same error that I observed for the OpenCL backend. I also observed that if you try very large subgroup sizes (> 64 I think) or very small subgroup sizes (e.g. size 1) then OpenCL backend will compile and run without any errors, presumably with a default subgroup size.

The error was at Runtime, the compilation didn't really result in any warning or error. Also I check with using -Wsycl-strict option if that has any guards but no-avail

@cperkinsintel
Copy link
Contributor

Forgive me if I'm just saying something you already know. The backend compiler runs when the app runs. The app loads up the kernel, sees what device it is being asked to run the kernel on, and then compiles it for that device. If you want to run the backend compiler in advance, use the -fsycl-targets flag to specify the hardware you want it compiled for. Example: -fsycl-targets=spir64_gen Now you'll get that error at your traditional "compile time", if that's what you want.

@gmlueck
Copy link
Contributor

gmlueck commented May 24, 2022

Do you know if this is the case? Could you clarify the current support for reqd_sub_group_size() in L0?

I think the current behavior is that you get a JIT-time compilation error if a kernel is decorated with reqd_sub_group_size that the device does not support. You can even get that error if some kernel in the program is defined with that attribute, even if the host code does not submit the kernel to any device.

We expect to fix this problem and change the runtime to throw the feature_not_supported exception when we implement this design: https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OptionalDeviceFeatures.md

pvchupin pushed a commit that referenced this issue Jun 4, 2022
This is a solution to #6103 for the CUDA case only. HIP AMD case still needs to be considered as discussed here: #6103 (comment).

CUDA only currently supports one subgroup (warp) size : 32 for all devices.
This PR introduces a solution to #6103 appropriate for backends which only support a single subgroup size: if the optional kernel attribute reqd_sub_group_size() is used with the supported subgroup size then it will compile and behave as the programmer intends. If reqd_sub_group_size() is used with another incompatible subgroup size a warning is returned when compiling, such as:

reqd-sub-group-size-cuda.cpp:12:73: warning: attribute argument 8 is invalid and will be ignored; CUDA requires sub_group size 32 [-Wcuda-compat]
    h.single_task<class invalid_kernel>([=] [[sycl::reqd_sub_group_size(8)]] {});
                                                                                                                ^
Signed-off-by: JackAKirk [email protected]
@JackAKirk JackAKirk added cuda CUDA back-end and removed cuda CUDA back-end labels Jun 14, 2022
@JackAKirk
Copy link
Contributor

JackAKirk commented Jun 14, 2022

I've removed the CUDA label although we can keep this issue open: for HIP and other backends we should wait for the more general fix that's coming using the OptionalDeviceFeatures. A simple warning has been added for CUDA since it's a simple case where only a single subgroup size exists.

@abagusetty
Copy link
Contributor Author

@JackAKirk A simple warning for now is plenty. Thanks!

@al42and
Copy link
Contributor

al42and commented May 8, 2023

You can even get that error if some kernel in the program is defined with that attribute, even if the host code does not submit the kernel to any device.

Are there any updates on fixing this?

-fsycl-device-code-split=per_kernel is a perfectly ok workaround, but the default behavior, without it, is quite counterintuitive: e.g., I am trying to launch a valid 16-wide kernel on A770, but get a JIT error about the device not supporting 64-wide kernels.

@gmlueck
Copy link
Contributor

gmlueck commented May 15, 2023

Are there any updates on fixing this?

I believe this should be fixed now in the intel/llvm repo, but only for JIT (SPIR-V) compilations. Are you still seeing an error without -fsycl-device-code-split=per_kernel?

Note that we do not yet have the implementation for AOT compilation (-fsycl-targets=spir64_gen), however you mention "JIT" above, so I think you are not using AOT.

@al42and
Copy link
Contributor

al42and commented May 17, 2023

$ clang++ --version
clang version 17.0.0 (https://github.com/intel/llvm 9cab102bf9c6f2f0dfb0113ab1091a3982e746e6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/aland/intel-sycl/llvm/build/install//bin

$ clang++ -fsycl sg.cpp -o sg && ONEAPI_DEVICE_SELECTOR=opencl:gpu ./sg
Calling 16
terminate called after throwing an instance of 'sycl::_V1::compile_program_error'
  what():  The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':

error: Unsupported required sub group size
in kernel: 'typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<Kernel<64> >'
error: backend compiler failed build.
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
Aborted (core dumped)
#include <sycl/sycl.hpp>
#include <vector>

template <int subGroupSize> class Kernel;

template <int subGroupSize> void run_kernel(const sycl::device &syclDevice) {
  static const int numThreads = 64;
  std::cout << "Calling " << subGroupSize << std::endl;
  sycl::queue queue = sycl::queue(syclDevice);
  auto buf = sycl::malloc_device<float>(1, queue);
  queue
      .submit([&](sycl::handler &cgh) {
        cgh.parallel_for<Kernel<subGroupSize>>(
            sycl::range<1>{numThreads}, [=
        ](sycl::id<1> threadId) [[sycl::reqd_sub_group_size(subGroupSize)]] {
              buf[0] = 1;
            });
      })
      .wait_and_throw();
  std::cout << "   Done!" << std::endl;
}

int main() {
  std::vector<sycl::device> devices = sycl::device::get_devices();
  for (const auto &dev : devices) {
    std::vector<size_t> subGroupSizes;
    subGroupSizes = dev.get_info<sycl::info::device::sub_group_sizes>();
    if (subGroupSizes[0] == 32) {
      run_kernel<32>(dev);
    } else if (subGroupSizes[0] == 64) {
      run_kernel<64>(dev);
    } else {
      assert(std::find(subGroupSizes.begin(), subGroupSizes.end(), 16) !=
             subGroupSizes.end());
      run_kernel<16>(dev);
    }
  }
}

@abagusetty
Copy link
Contributor Author

@al42and This is interesting. I believe DG1 cards are mostly 8,16, and 32. Can you please verify that this RT error is only getting triggered for 16 only and not for 8 and 32. Also could be a limitation from OpenCL plugin. Would you be able to verify this for L0 plugin as well

@al42and
Copy link
Contributor

al42and commented May 17, 2023

I believe it's not related to PI but to how JIT is called. Same behavior with L0, I'm invoking 16-wide flavor, but get an error related to (instantiated, but not called) 64-wide flavor:

$ clang++ -fsycl sg.cpp -o sg && ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./sg
Calling 16
terminate called after throwing an instance of 'sycl::_V1::compile_program_error'
  what():  The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':

error: Unsupported required sub group size
in kernel: 'typeinfo name for sycl::_V1::detail::__pf_kernel_wrapper<Kernel<64> >'
error: backend compiler failed build.
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
Aborted (core dumped)

Can you please verify that this RT error is only getting triggered for 16 only and not for 8 and 32.

If I only have 8, 16, and 32-wide kernels, everything works fine since the device supports all of them.

@jzc
Copy link
Contributor

jzc commented May 18, 2023

Are there any updates on fixing this?

I believe this should be fixed now in the intel/llvm repo, but only for JIT (SPIR-V) compilations. Are you still seeing an error without -fsycl-device-code-split=per_kernel?

Note that we do not yet have the implementation for AOT compilation (-fsycl-targets=spir64_gen), however you mention "JIT" above, so I think you are not using AOT.

It is WIP - module splitting based on aspects (e.g. fp16,fp64, atomic64) and reqd_work_group_size should be implemented in the intel/llvm, but not for reqd_sub_group_size, which is why @al42and your program fails - the 16, 32, and 64 sub-group size kernels still get bundled together and get compiled for an unsupported platform. There was #8167 by @dm-vodopyanov to implement the split based on sub-group size but it ran it to some issues with ESIMD. Now, that PR needs to be adapted to work with the changes made in #8833.

@fwyzard
Copy link
Contributor

fwyzard commented Jul 6, 2023

It looks like #8167 was superseded by #9928, which was merged last week 👍🏻

Does that means that it should be possible to compile kernels with optional features AOT (as long as they are actually called only when those features are available) ?

Does it ned some extra compilation flags ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend. runtime Runtime library related issue
Projects
None yet
Development

No branches or pull requests

10 participants