Skip to content

[SYCL] [AMDGPU] Ignore incorrect sub-group size #11687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 21, 2023

Conversation

jchlanda
Copy link
Contributor

CDNA supports only 64 wave front size, for those GPUs allow subgroup size of 64. Some GPUs support both 32 and 64, for those (and the rest) only allow 32.

@jchlanda jchlanda requested a review from a team as a code owner October 27, 2023 09:20
@jchlanda jchlanda temporarily deployed to WindowsCILock October 27, 2023 09:23 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to WindowsCILock October 27, 2023 10:18 — with GitHub Actions Inactive
CDNA supports only 64 wave front size, for those GPUs set subgroup size
to 64. Some GPUS support both 32 and 64, for those (and the rest) only
allow 32.
@jchlanda jchlanda force-pushed the jakub/amd_sub_group_warning branch from 8c22cd8 to ac489a5 Compare October 30, 2023 17:10
@jchlanda jchlanda temporarily deployed to WindowsCILock October 30, 2023 17:11 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to WindowsCILock October 30, 2023 17:48 — with GitHub Actions Inactive
@jchlanda
Copy link
Contributor Author

jchlanda commented Nov 1, 2023

Friendly ping @intel/dpcpp-cfe-reviewers

@jchlanda jchlanda temporarily deployed to WindowsCILock November 1, 2023 13:53 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to WindowsCILock November 1, 2023 14:29 — with GitHub Actions Inactive
@jchlanda
Copy link
Contributor Author

jchlanda commented Nov 8, 2023

Added a test to make sure that the values are indeed ignored: 0a72b7b

Copy link
Contributor

@smanna12 smanna12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@againull
Copy link
Contributor

againull commented Nov 8, 2023

Test failure:
SYCL :: OptionalKernelFeatures/is_compatible/is_compatible_with_aspects.cpp
seems to be related to the patch.

@jchlanda
Copy link
Contributor Author

jchlanda commented Nov 9, 2023

Test failure: SYCL :: OptionalKernelFeatures/is_compatible/is_compatible_with_aspects.cpp seems to be related to the patch.

That's right, this patch changes the handling of incorrect sub-group attribute, such that it doesn't end up in the resulting binary image.
The way this test works is that it creates a kernel with a known, incorrect required sub group size (INT_MAX). Then expects that kernel to be incompatible with the device, by failing a check against supported sub-group sizes.

There are two ways of thinking about that issue:

I'm not sure which is the correct way of handling those values.
Would you be able to advise @againull @elizabethandrews @smanna12 ?

@elizabethandrews
Copy link
Contributor

elizabethandrews commented Nov 9, 2023

Test failure: SYCL :: OptionalKernelFeatures/is_compatible/is_compatible_with_aspects.cpp seems to be related to the patch.

That's right, this patch changes the handling of incorrect sub-group attribute, such that it doesn't end up in the resulting binary image. The way this test works is that it creates a kernel with a known, incorrect required sub group size (INT_MAX). Then expects that kernel to be incompatible with the device, by failing a check against supported sub-group sizes.

There are two ways of thinking about that issue:

I'm not sure which is the correct way of handling those values. Would you be able to advise @againull @elizabethandrews @smanna12 ?

I don't think it makes sense allowing an invalid subgroup size to pass through if we're emitting the warning. IMO we should either just allow the incorrect value to pass through without a warning and then have the backend deal with it however it is doing so currently, or we do what this PR does - i.e. emit warning and drop the attribute. @AlexeySachkov can you weigh in here? Is there a reason we are passing though invalid subgroup size?

@jchlanda
Copy link
Contributor Author

Any preference on this @AlexeySachkov ?

@jchlanda
Copy link
Contributor Author

Any preference on this @AlexeySachkov ?

@AlexeySachkov friendly ping on this one please.

@AlexeySachkov
Copy link
Contributor

Test failure: SYCL :: OptionalKernelFeatures/is_compatible/is_compatible_with_aspects.cpp seems to be related to the patch.

That's right, this patch changes the handling of incorrect sub-group attribute, such that it doesn't end up in the resulting binary image. The way this test works is that it creates a kernel with a known, incorrect required sub group size (INT_MAX). Then expects that kernel to be incompatible with the device, by failing a check against supported sub-group sizes.
There are two ways of thinking about that issue:

I'm not sure which is the correct way of handling those values. Would you be able to advise @againull @elizabethandrews @smanna12 ?

I don't think it makes sense allowing an invalid subgroup size to pass through if we're emitting the warning. IMO we should either just allow the incorrect value to pass through without a warning and then have the backend deal with it however it is doing so currently, or we do what this PR does - i.e. emit warning and drop the attribute. @AlexeySachkov can you weigh in here? Is there a reason we are passing though invalid subgroup size?

From SYCL spec point of view, there are no compile-time known incorrect sub-group sizes - what we have for AMD and CUDA are implementation details of those backends.

Explicitly requesting sub-group size of a kernel has limited portability by design, but considering how fundamental and narrow the selection of possible sub-group sizes for AMD/CUDA, it might make sense to promote that knowledge into the compiler in form of the suggested warning to improve user experience.

I think that it may actually make sense to pass the invalid sizes through even if we diagnosed them with a warning: this way runtime handling of the attribute for AMD/CUDA backend would be uniform with other backends and users will get more consistent experience. At the same time the warning will let users know in advance that their application won't work and it requires some code changes.

The only concern about the warning I have is that it may produce false alarms: what if there is a kernel which is never submitted to a AMD/CUDA device? The warning will still be there and it may be annoying for someone who uses -Werror. Once we get proper support for optional kernel features for AOT targets it may be even easier to discover that

@jchlanda
Copy link
Contributor Author

Test failure: SYCL :: OptionalKernelFeatures/is_compatible/is_compatible_with_aspects.cpp seems to be related to the patch.

That's right, this patch changes the handling of incorrect sub-group attribute, such that it doesn't end up in the resulting binary image. The way this test works is that it creates a kernel with a known, incorrect required sub group size (INT_MAX). Then expects that kernel to be incompatible with the device, by failing a check against supported sub-group sizes.
There are two ways of thinking about that issue:

I'm not sure which is the correct way of handling those values. Would you be able to advise @againull @elizabethandrews @smanna12 ?

I don't think it makes sense allowing an invalid subgroup size to pass through if we're emitting the warning. IMO we should either just allow the incorrect value to pass through without a warning and then have the backend deal with it however it is doing so currently, or we do what this PR does - i.e. emit warning and drop the attribute. @AlexeySachkov can you weigh in here? Is there a reason we are passing though invalid subgroup size?

From SYCL spec point of view, there are no compile-time known incorrect sub-group sizes - what we have for AMD and CUDA are implementation details of those backends.

Explicitly requesting sub-group size of a kernel has limited portability by design, but considering how fundamental and narrow the selection of possible sub-group sizes for AMD/CUDA, it might make sense to promote that knowledge into the compiler in form of the suggested warning to improve user experience.

I think that it may actually make sense to pass the invalid sizes through even if we diagnosed them with a warning: this way runtime handling of the attribute for AMD/CUDA backend would be uniform with other backends and users will get more consistent experience. At the same time the warning will let users know in advance that their application won't work and it requires some code changes.

Following you suggestion, I've let the invalid value pass through. I've removed a test, since no values are being ignored.

The only concern about the warning I have is that it may produce false alarms: what if there is a kernel which is never submitted to a AMD/CUDA device? The warning will still be there and it may be annoying for someone who uses -Werror. Once we get proper support for optional kernel features for AOT targets it may be even easier to discover that

I feel that it's an edge case, that is outweighed by the benefit of informing the users about the incorrect size, and perhaps we could ignore it for now. If that ever becomes an issue, it would be simple enough to provide a mechanism to silence this particular warning?

@AlexeySachkov
Copy link
Contributor

AlexeySachkov commented Nov 21, 2023

The only concern about the warning I have is that it may produce false alarms: what if there is a kernel which is never submitted to a AMD/CUDA device? The warning will still be there and it may be annoying for someone who uses -Werror. Once we get proper support for optional kernel features for AOT targets it may be even easier to discover that

I feel that it's an edge case, that is outweighed by the benefit of informing the users about the incorrect size, and perhaps we could ignore it for now. If that ever becomes an issue, it would be simple enough to provide a mechanism to silence this particular warning?

I think that we may encounter that sooner than later, but I do agree that benefit of the warning probably outweighs the -Werror concern. I'm fine with the direction of this PR, but we should keep in mind that at some point we will likely need a knob to disable that particular warning.

@al42and
Copy link
Contributor

al42and commented Nov 21, 2023

I feel that it's an edge case, that is outweighed by the benefit of informing the users about the incorrect size, and perhaps we could ignore it for now. If that ever becomes an issue, it would be simple enough to provide a mechanism to silence this particular warning?

As a person compiling with -Werror and for multiple sub-group sizes (toy example), I mostly agree there; being able to suppress the warning is enough.

It looks like currently the required error suppression would be too broad. Originally (#6183), -Wno-cuda-compat was enough, which does not suppress too much relevant warnings. Now, -Wno-unknown-attributes would be needed, which can hide true positives.

@againull againull merged commit 6bce7f6 into intel:sycl Nov 21, 2023
againull pushed a commit that referenced this pull request Nov 29, 2023
#11991)

As a follow up to #11687, this PR adds
a mechanism to silence the warning using dedicated switch
`-Wno-incorrect-sub-group-size` that is wrapped in the `-Wno-attribute`
group.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants