an extension that adds a device side abort function #808

pjaaskel · 2022-06-21T12:11:31Z

With the device side abort extension, a work-item (WI) can return from the kernel execution at any point and cause abnormal unrecoverable termination of the host process.

This extension differs from the cl_arm_controlled_kernel_termination extension in that the abort in the device side is expected to behave like a call to the POSIX abort() in the host side, terminating also the host process immediately.

The extension is meant to easily support POSIX abort()-like functionality on the device side, as well as serve as a basis for CUDA/HIP-style assertions.

An example CPU implementation in PoCL: parmance/pocl@d5e88c7

State explicitly that a compliant CPU device implementation can call abort() directly. Clarified the control flow behavior.

Kerilk · 2022-07-12T17:12:02Z

Here is the accompanying SPIRV extension:
KhronosGroup/SPIRV-Registry#149

bashbaug · 2022-09-22T00:41:02Z

Discussed in the September 20th teleconference:

Use-case: similar to __trap() in CUDA.
There are several similar proposals and it would be nice to consolidate into an EXT extension (this PR).
There is a related SPIR-V extension (linked above) but it has not been implemented (yet).
As-written, the extension will terminate the entire process when a device kernel calls the abort function.
There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?
For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

pjaaskel · 2022-09-22T07:33:54Z

There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?

Like I suggested in the call, I think these are two different use cases which could call for separate extensions to not make it too difficult to support only one of them. The primary use case for this simple one is to be part of an assert() implementation: To allow more easy porting (even automated migration) of host functions that have asserts() or abort() calls to device-side executed code.

For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

In this extension, the expected behavior for commands is the same as with any multithreaded program where one thread calls the standard abort(). Other parallel threads (commands) might have proceeded further or not, but the end result is either catching SIGABRT or brutally killing the process along with its threads (in this case also device/GPU threads).

The WI semantics I tried to describe in the last paragraph: https://github.com/KhronosGroup/OpenCL-Docs/pull/808/files#diff-149e893d23663ca01188af6d03a0ebd77bae5776abefe7ec6b063e4bbd88212fR103

bashbaug · 2024-12-17T19:23:15Z

Discussed in the December 17th teleconference. Some of this functionality appears to be implemented in Mesa:

https://www.phoronix.com/news/OpenCL-C-Std-Lib-Mesa-25.0
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32529

We should see what it would take to get this across the finish line.

pjaaskel · 2025-07-24T14:51:55Z

@karolherbst do you have feedback on this from the perspective of Mesa/Rusticl?

karolherbst · 2025-07-24T17:28:56Z

extensions/cl_ext_device_side_abort.asciidoc

+
+The results of an aborted kernel command are undefined. Device side abort is
+guaranteed to stop the kernel command as soon as possible, even in the presence
+of deadlocks or partially reached barriers in the kernel command.


Unsure to be honest. I think it's a good idea in principle, but it does change certain implementation details. Like if a thread calling abort_platform causes a dead-lock, it might lead to a kernel driver to timeout the GPU job and that alone is undefined behavior, not just "undefined results".

Some implementations might just not be able to terminate the running kernel without undefined behavior. But now this extension actually defines it: it needs to terminate at some point.

I think calling abort_platform in itself causes undefined behavior and runtimes can only guarantee anything that happens afterwards on a best effort basis.

The basis for this extension's semantics are the one of libc's abort() and the __trap() of CUDA. It would be useful to be able to implement both of those with this extension.

I'm not sure what CUDA's "generate an interrupt to the host CPU" should imply, but to me the natural semantics here is to abort the whole (host) process and bring down the GPU resources, similar to killing a threaded CPU-only host application with multiple threads running and one of them calling abort().

Right, I'm not concerned about Nvidia being able to reliably implement such a functionality, I'm more concerned about giving too strong guarantees for all the other GPU hardware out there. Nvidia has a BPT instruction which can create a trap signal that is either handled by a GPU queue configured trap handler (used to implement a debugger, e.g. single-stepping) or raised to the OS. So the OS can just handle that signal and deal with it by e.g. forcing a termination of the running workload.

I'm not sure every other GPU vendor has something similar. Maybe it's good enough to make the behavior after abort_platform implementation defined, but urge every implementation to at least try to guarantee the things mandated by this extension?

I'd rather see well-defined behavior and if the device cannot support it, then it just doesn't advertise support for the extension. My guess is that both AMD and Intel have some sort of features that could be used to implement the "canceling" of the GPU workload when signaled by the device side program (e.g. via a system shared global variable), even in presence of kernel deadlocks.

AMD has a pretty aggressive way of dealing with GPU side timeouts, which often lead to full GPU resets impacting other parts of the system, e.g. your desktop session could also just die.

I'm sure Intel will manage, I have my doubts about AMD guaranteeing no abnormal behavior of the system after abort_platform causes dead-locks, invalid memory accesses, etc...

chipStar implements (or workarounds the lack of) this extension by means of a host-shared variable which a host thread polls and if set to 1 by the device, calls abort(). Do you suggest that even if the process is aborted this way (thus invoking the OS resource cleanup) there is a risk of instability in AMD's case? HIP provides a device reset API. Something like that could be invoked from the host side when device abort is detected for a cleaner ramp-down.

From my experience with AMD, hipDeviceReset sounds like its indeed a full device reset, resetting it for every application running on the system :)

chipStar implements (or workarounds the lack of) this extension by means of a host-shared variable which a host thread polls and if set to 1 by the device, calls abort(). Do you suggest that even if the process is aborted this way (thus invoking the OS resource cleanup) there is a risk of instability in AMD's case?

Only if the kernel itself is misbehaving as a result. If the kernel just terminates normally, even after one thread quits, there is no danger. The issue simply just starts once it times out, or causes invalid memory accesses, etc....

karolherbst · 2025-07-24T17:34:42Z

extensions/cl_ext_device_side_abort.asciidoc

+For possible WIs that do not call abort, the return time is undefined and can
+adhere to the original control flow of the kernel as long as all the WIs
+eventually finish the execution and allow the host runtime to abort the
+process also in a single-threaded/non-interrupting implementation.


The phrasing here is a bit hard to grasp what's actually meant.

Assuming we have control flow that is well defined with all threads participating, but now a single thread calls abort_platform and quits. If that causes an infinite loop an implementation might be able to guarantee that "the return time is undefined and can adhere to the original control flow of the kernel", however, this extension specifies that an implementation is only allowed of doing so "as long as all the WIs eventually finish the execution and allow the host runtime to abort the process", but what if they can't, because the one thread is causing issues?

Is it legal for an implementation to time out the kernel invocation, which might even cause system level problems, like a full GPU reset or even worse, a system reboot, in which case you won't get to see any printf buffer's contents?

You are right. This paragraph is in a conflict with the ability to abort kernel commands that have deadlocked (partial barrier reach or other reason). I think it could be just removed while fulfilling the goal of being able to implement libc abort and CUDA trap.

pjaaskel · 2025-07-28T11:36:05Z

extensions/cl_ext_device_side_abort.asciidoc

+of a device-side abort with a CPU device can simply call the standard abort()
+directly to implement compliant behavior.
+
+The results of an aborted kernel command are undefined. Device side abort is


Suggested change

The results of an aborted kernel command are undefined. Device side abort is

The results produced by an aborted kernel command are undefined. Device side abort is

pjaaskel · 2025-07-28T11:41:24Z

extensions/cl_ext_device_side_abort.asciidoc

+For possible WIs that do not call abort, the return time is undefined and can
+adhere to the original control flow of the kernel as long as all the WIs
+eventually finish the execution and allow the host runtime to abort the
+process also in a single-threaded/non-interrupting implementation.


Suggested change

For possible WIs that do not call abort, the return time is undefined and can

adhere to the original control flow of the kernel as long as all the WIs

eventually finish the execution and allow the host runtime to abort the

process also in a single-threaded/non-interrupting implementation.

For possible other WIs that do not call abort can run for an undefined and

adhere to the original control flow of the kernel until the host detects the abort

request and ramps down the execution. Work-items misbehaving (for example

by means of illegal memory accessess) after the abort has been called result in undefined behavior, as normally.

@karolherbst could something along this lines cover your concerns?

small typo: "Possible other WIs that do not call abort can run for an undefined time ..."

But yeah, I think this covers it pretty well.

pjaaskel added 3 commits April 8, 2022 13:53

cl_pocl_device_side_abort rev 1

68406ef

Device side abort updates and clarifications

a05784e

State explicitly that a compliant CPU device implementation can call abort() directly. Clarified the control flow behavior.

Change the device side abort to 'ext' extension

152333e

pjaaskel mentioned this pull request Jun 21, 2022

an extension that adds a device side abort function KhronosGroup/SPIRV-Registry#149

Draft

pjaaskel mentioned this pull request Jul 24, 2025

Unreachable control flow: PoCL executes unreachable instructions pocl/pocl#1971

Open

karolherbst reviewed Jul 24, 2025

View reviewed changes

pjaaskel commented Jul 28, 2025

View reviewed changes

	The results of an aborted kernel command are undefined. Device side abort is
	The results produced by an aborted kernel command are undefined. Device side abort is

an extension that adds a device side abort function #808

Are you sure you want to change the base?

an extension that adds a device side abort function #808

Conversation

pjaaskel commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kerilk commented Jul 12, 2022

Uh oh!

bashbaug commented Sep 22, 2022

Uh oh!

pjaaskel commented Sep 22, 2022

Uh oh!

bashbaug commented Dec 17, 2024

Uh oh!

pjaaskel commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karolherbst Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pjaaskel commented Jun 21, 2022 •

edited

Loading

karolherbst Jul 25, 2025 •

edited

Loading