Skip to content

Conversation

pjaaskel
Copy link
Contributor

This extension adds a new built-in function to perform barrier synchronization across the work-group even if some of the work-items are not "alive" anymore due to having returned from the kernel.

This extension adds a new built-in function to perform barrier
synchronization across the work-group even if some of the work-items
are not "alive" anymore due to having returned from the kernel.
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@pjaaskel
Copy link
Contributor Author

Ping @kpet @bashbaug @Kerilk @karolherbst.

@karolherbst
Copy link
Contributor

karolherbst commented May 19, 2025

I wonder if this is necessary and if the OpenCL spec could be relaxed instead here. The OpControlBarrier SPIR-V instruction already has this definition that it only waits on active invocations (see khronos internal SPIR-V MRs 280 and 329) and I'm not aware of any hardware that behaves any differently.

@pjaaskel
Copy link
Contributor Author

@karolherbst: interesting. I didn't know SPIR_V changed the barrier semantics in v1.7. I don't see the wording spelled out in the spec explicitly.

This is a pretty drastic change, which basically makes v1.7 backwards incompatible with v1.6 for targets which do not implement the "active/alive only" semantics. There could be devices we don't know of where it's (significantly) more expensive to implement. Also vectorizing WGs of kernels with such barriers on CPU/SIMD, especially on non-predicated vector ISAs induces overheads. The cases should be compile-time analyzable though.

@karolherbst
Copy link
Contributor

@karolherbst: interesting. I didn't know SPIR_V changed the barrier semantics in v1.7. I don't see the wording spelled out in the spec explicitly.

This is a pretty drastic change, which basically makes v1.7 backwards incompatible with v1.6 for targets which do not implement the "active/alive only" semantics. There could be devices we don't know of where it's (significantly) more expensive to implement. Also vectorizing WGs of kernels with such barriers on CPU/SIMD, especially on non-predicated vector ISAs induces overheads. The cases should be compile-time analyzable though.

I doubt it's problematic for anything not being a CPU, because the threading model is just entirely different there and compares more to masked/predicated SIMD instructions. But maybe it's best to discuss this at the WG meeting and ask everybody to check if anybody sees any problems with it from a hardware perspective.

Would be a bit problematic for CPU implementations, so maybe for those it might make sense to keep it explicit.

@bashbaug
Copy link
Contributor

Couple of thoughts and corrections:

  1. There is no SPIR-V 1.7 (at least not yet!), and these updates were made in a revision to SPIR-V 1.6. I believe the change to the SPIR-V spec was necessary to allow for some types of Subgroup barriers, which may be placed in non-uniform control flow on some Vulkan devices.
  2. Reading through internal SPIR-V MRs 280, it sounds like the SPIR-V spec update was supposed to be paired with an update to the client API environment specifications to keep the current behavior, but it doesn't look like that happened. There is a validation rule for Subgroup scope barriers, but not one for Workgroup scope:

    In all OpenCL environments, for the Barrier Instruction OpControlBarrier, when the Scope for Execution is Subgroup, behavior is undefined unless all invocations in the sub-group execute the same dynamic instance of the instruction.

  3. At least some of our GPUs would not be able to support this "alive only barrier", so we would not be able to simply relax the specification and allow this behavior unconditionally.

If this is correct, we should file an issue to add the right validation rule for Workgroup scope barriers in the OpenCL SPIR-V environment spec as well.

@karolherbst
Copy link
Contributor

3. At least some of our GPUs would not be able to support this "alive only barrier", so we would not be able to simply relax the specification and allow this behavior unconditionally.

I assume it's a problem on older ones?

@pjaaskel
Copy link
Contributor Author

@bashbaug thanks for the clarifications. I suggest we start with a new built-in and consider converting it to a main spec requirement in the future when there are no more relevant devices where requiring the semantics is a problem.

Having the semantics as the default barrier semantics, in case of CPUs/SIMD vectorization it would add a bit of control flow analysis to detect the cases when predication is not needed. I think it's nothing to be too worried about for the most of the cases.

How I see this used is for using it only when generating from inputs which might have the semantics in the language (HIP/CUDA). Even in those cases it makes sense to CF-analyze the kernel first to find out if it really needs the semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants