Skip to content

Conversation

pjaaskel
Copy link
Contributor

@pjaaskel pjaaskel commented Jun 21, 2022

With the device side abort extension, a work-item (WI) can return from the kernel execution at any point and cause abnormal unrecoverable termination of the host process.

This extension differs from the cl_arm_controlled_kernel_termination extension in that the abort in the device side is expected to behave like a call to the POSIX abort() in the host side, terminating also the host process immediately.

The extension is meant to easily support POSIX abort()-like functionality on the device side, as well as serve as a basis for CUDA/HIP-style assertions.

An example CPU implementation in PoCL: parmance/pocl@d5e88c7

pjaaskel added 3 commits April 8, 2022 13:53
State explicitly that a compliant CPU device implementation can call
abort() directly. Clarified the control flow behavior.
@Kerilk
Copy link
Contributor

Kerilk commented Jul 12, 2022

Here is the accompanying SPIRV extension:
KhronosGroup/SPIRV-Registry#149

@bashbaug
Copy link
Contributor

Discussed in the September 20th teleconference:

  • Use-case: similar to __trap() in CUDA.
  • There are several similar proposals and it would be nice to consolidate into an EXT extension (this PR).
  • There is a related SPIR-V extension (linked above) but it has not been implemented (yet).
  • As-written, the extension will terminate the entire process when a device kernel calls the abort function.
  • There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?
  • For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

@pjaaskel
Copy link
Contributor Author

There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?

Like I suggested in the call, I think these are two different use cases which could call for separate extensions to not make it too difficult to support only one of them. The primary use case for this simple one is to be part of an assert() implementation: To allow more easy porting (even automated migration) of host functions that have asserts() or abort() calls to device-side executed code.

For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

In this extension, the expected behavior for commands is the same as with any multithreaded program where one thread calls the standard abort(). Other parallel threads (commands) might have proceeded further or not, but the end result is either catching SIGABRT or brutally killing the process along with its threads (in this case also device/GPU threads).

The WI semantics I tried to describe in the last paragraph: https://github.com/KhronosGroup/OpenCL-Docs/pull/808/files#diff-149e893d23663ca01188af6d03a0ebd77bae5776abefe7ec6b063e4bbd88212fR103

@bashbaug
Copy link
Contributor

Discussed in the December 17th teleconference. Some of this functionality appears to be implemented in Mesa:

https://www.phoronix.com/news/OpenCL-C-Std-Lib-Mesa-25.0
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32529

We should see what it would take to get this across the finish line.

@pjaaskel
Copy link
Contributor Author

@karolherbst do you have feedback on this from the perspective of Mesa/Rusticl?


The results of an aborted kernel command are undefined. Device side abort is
guaranteed to stop the kernel command as soon as possible, even in the presence
of deadlocks or partially reached barriers in the kernel command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure to be honest. I think it's a good idea in principle, but it does change certain implementation details. Like if a thread calling abort_platform causes a dead-lock, it might lead to a kernel driver to timeout the GPU job and that alone is undefined behavior, not just "undefined results".

Some implementations might just not be able to terminate the running kernel without undefined behavior. But now this extension actually defines it: it needs to terminate at some point.

I think calling abort_platform in itself causes undefined behavior and runtimes can only guarantee anything that happens afterwards on a best effort basis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basis for this extension's semantics are the one of libc's abort() and the __trap() of CUDA. It would be useful to be able to implement both of those with this extension.

I'm not sure what CUDA's "generate an interrupt to the host CPU" should imply, but to me the natural semantics here is to abort the whole (host) process and bring down the GPU resources, similar to killing a threaded CPU-only host application with multiple threads running and one of them calling abort().

Copy link
Contributor

@karolherbst karolherbst Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm not concerned about Nvidia being able to reliably implement such a functionality, I'm more concerned about giving too strong guarantees for all the other GPU hardware out there. Nvidia has a BPT instruction which can create a trap signal that is either handled by a GPU queue configured trap handler (used to implement a debugger, e.g. single-stepping) or raised to the OS. So the OS can just handle that signal and deal with it by e.g. forcing a termination of the running workload.

I'm not sure every other GPU vendor has something similar. Maybe it's good enough to make the behavior after abort_platform implementation defined, but urge every implementation to at least try to guarantee the things mandated by this extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see well-defined behavior and if the device cannot support it, then it just doesn't advertise support for the extension. My guess is that both AMD and Intel have some sort of features that could be used to implement the "canceling" of the GPU workload when signaled by the device side program (e.g. via a system shared global variable), even in presence of kernel deadlocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD has a pretty aggressive way of dealing with GPU side timeouts, which often lead to full GPU resets impacting other parts of the system, e.g. your desktop session could also just die.

I'm sure Intel will manage, I have my doubts about AMD guaranteeing no abnormal behavior of the system after abort_platform causes dead-locks, invalid memory accesses, etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chipStar implements (or workarounds the lack of) this extension by means of a host-shared variable which a host thread polls and if set to 1 by the device, calls abort(). Do you suggest that even if the process is aborted this way (thus invoking the OS resource cleanup) there is a risk of instability in AMD's case? HIP provides a device reset API. Something like that could be invoked from the host side when device abort is detected for a cleaner ramp-down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience with AMD, hipDeviceReset sounds like its indeed a full device reset, resetting it for every application running on the system :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chipStar implements (or workarounds the lack of) this extension by means of a host-shared variable which a host thread polls and if set to 1 by the device, calls abort(). Do you suggest that even if the process is aborted this way (thus invoking the OS resource cleanup) there is a risk of instability in AMD's case?

Only if the kernel itself is misbehaving as a result. If the kernel just terminates normally, even after one thread quits, there is no danger. The issue simply just starts once it times out, or causes invalid memory accesses, etc....

For possible WIs that do not call abort, the return time is undefined and can
adhere to the original control flow of the kernel as long as all the WIs
eventually finish the execution and allow the host runtime to abort the
process also in a single-threaded/non-interrupting implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing here is a bit hard to grasp what's actually meant.

Assuming we have control flow that is well defined with all threads participating, but now a single thread calls abort_platform and quits. If that causes an infinite loop an implementation might be able to guarantee that "the return time is undefined and can adhere to the original control flow of the kernel", however, this extension specifies that an implementation is only allowed of doing so "as long as all the WIs eventually finish the execution and allow the host runtime to abort the process", but what if they can't, because the one thread is causing issues?

Is it legal for an implementation to time out the kernel invocation, which might even cause system level problems, like a full GPU reset or even worse, a system reboot, in which case you won't get to see any printf buffer's contents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This paragraph is in a conflict with the ability to abort kernel commands that have deadlocked (partial barrier reach or other reason). I think it could be just removed while fulfilling the goal of being able to implement libc abort and CUDA trap.

of a device-side abort with a CPU device can simply call the standard abort()
directly to implement compliant behavior.

The results of an aborted kernel command are undefined. Device side abort is
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The results of an aborted kernel command are undefined. Device side abort is
The results produced by an aborted kernel command are undefined. Device side abort is

Comment on lines +106 to +109
For possible WIs that do not call abort, the return time is undefined and can
adhere to the original control flow of the kernel as long as all the WIs
eventually finish the execution and allow the host runtime to abort the
process also in a single-threaded/non-interrupting implementation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For possible WIs that do not call abort, the return time is undefined and can
adhere to the original control flow of the kernel as long as all the WIs
eventually finish the execution and allow the host runtime to abort the
process also in a single-threaded/non-interrupting implementation.
For possible other WIs that do not call abort can run for an undefined and
adhere to the original control flow of the kernel until the host detects the abort
request and ramps down the execution. Work-items misbehaving (for example
by means of illegal memory accessess) after the abort has been called result in undefined behavior, as normally.

@karolherbst could something along this lines cover your concerns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo: "Possible other WIs that do not call abort can run for an undefined time ..."

But yeah, I think this covers it pretty well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants