-
Notifications
You must be signed in to change notification settings - Fork 123
an extension that adds a device side abort function #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
State explicitly that a compliant CPU device implementation can call abort() directly. Clarified the control flow behavior.
Here is the accompanying SPIRV extension: |
Discussed in the September 20th teleconference:
|
Like I suggested in the call, I think these are two different use cases which could call for separate extensions to not make it too difficult to support only one of them. The primary use case for this simple one is to be part of an assert() implementation: To allow more easy porting (even automated migration) of host functions that have asserts() or abort() calls to device-side executed code.
In this extension, the expected behavior for commands is the same as with any multithreaded program where one thread calls the standard abort(). Other parallel threads (commands) might have proceeded further or not, but the end result is either catching SIGABRT or brutally killing the process along with its threads (in this case also device/GPU threads). The WI semantics I tried to describe in the last paragraph: https://github.com/KhronosGroup/OpenCL-Docs/pull/808/files#diff-149e893d23663ca01188af6d03a0ebd77bae5776abefe7ec6b063e4bbd88212fR103 |
Discussed in the December 17th teleconference. Some of this functionality appears to be implemented in Mesa: https://www.phoronix.com/news/OpenCL-C-Std-Lib-Mesa-25.0 We should see what it would take to get this across the finish line. |
@karolherbst do you have feedback on this from the perspective of Mesa/Rusticl? |
|
||
The results of an aborted kernel command are undefined. Device side abort is | ||
guaranteed to stop the kernel command as soon as possible, even in the presence | ||
of deadlocks or partially reached barriers in the kernel command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure to be honest. I think it's a good idea in principle, but it does change certain implementation details. Like if a thread calling abort_platform
causes a dead-lock, it might lead to a kernel driver to timeout the GPU job and that alone is undefined behavior, not just "undefined results".
Some implementations might just not be able to terminate the running kernel without undefined behavior. But now this extension actually defines it: it needs to terminate at some point.
I think calling abort_platform
in itself causes undefined behavior and runtimes can only guarantee anything that happens afterwards on a best effort basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basis for this extension's semantics are the one of libc's abort() and the __trap() of CUDA. It would be useful to be able to implement both of those with this extension.
I'm not sure what CUDA's "generate an interrupt to the host CPU" should imply, but to me the natural semantics here is to abort the whole (host) process and bring down the GPU resources, similar to killing a threaded CPU-only host application with multiple threads running and one of them calling abort().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm not concerned about Nvidia being able to reliably implement such a functionality, I'm more concerned about giving too strong guarantees for all the other GPU hardware out there. Nvidia has a BPT
instruction which can create a trap signal that is either handled by a GPU queue configured trap handler (used to implement a debugger, e.g. single-stepping) or raised to the OS. So the OS can just handle that signal and deal with it by e.g. forcing a termination of the running workload.
I'm not sure every other GPU vendor has something similar. Maybe it's good enough to make the behavior after abort_platform
implementation defined, but urge every implementation to at least try to guarantee the things mandated by this extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see well-defined behavior and if the device cannot support it, then it just doesn't advertise support for the extension. My guess is that both AMD and Intel have some sort of features that could be used to implement the "canceling" of the GPU workload when signaled by the device side program (e.g. via a system shared global variable), even in presence of kernel deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD has a pretty aggressive way of dealing with GPU side timeouts, which often lead to full GPU resets impacting other parts of the system, e.g. your desktop session could also just die.
I'm sure Intel will manage, I have my doubts about AMD guaranteeing no abnormal behavior of the system after abort_platform
causes dead-locks, invalid memory accesses, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chipStar implements (or workarounds the lack of) this extension by means of a host-shared variable which a host thread polls and if set to 1 by the device, calls abort(). Do you suggest that even if the process is aborted this way (thus invoking the OS resource cleanup) there is a risk of instability in AMD's case? HIP provides a device reset API. Something like that could be invoked from the host side when device abort is detected for a cleaner ramp-down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience with AMD, hipDeviceReset
sounds like its indeed a full device reset, resetting it for every application running on the system :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chipStar implements (or workarounds the lack of) this extension by means of a host-shared variable which a host thread polls and if set to 1 by the device, calls abort(). Do you suggest that even if the process is aborted this way (thus invoking the OS resource cleanup) there is a risk of instability in AMD's case?
Only if the kernel itself is misbehaving as a result. If the kernel just terminates normally, even after one thread quits, there is no danger. The issue simply just starts once it times out, or causes invalid memory accesses, etc....
For possible WIs that do not call abort, the return time is undefined and can | ||
adhere to the original control flow of the kernel as long as all the WIs | ||
eventually finish the execution and allow the host runtime to abort the | ||
process also in a single-threaded/non-interrupting implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrasing here is a bit hard to grasp what's actually meant.
Assuming we have control flow that is well defined with all threads participating, but now a single thread calls abort_platform
and quits. If that causes an infinite loop an implementation might be able to guarantee that "the return time is undefined and can adhere to the original control flow of the kernel", however, this extension specifies that an implementation is only allowed of doing so "as long as all the WIs eventually finish the execution and allow the host runtime to abort the process", but what if they can't, because the one thread is causing issues?
Is it legal for an implementation to time out the kernel invocation, which might even cause system level problems, like a full GPU reset or even worse, a system reboot, in which case you won't get to see any printf buffer's contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This paragraph is in a conflict with the ability to abort kernel commands that have deadlocked (partial barrier reach or other reason). I think it could be just removed while fulfilling the goal of being able to implement libc abort and CUDA trap.
of a device-side abort with a CPU device can simply call the standard abort() | ||
directly to implement compliant behavior. | ||
|
||
The results of an aborted kernel command are undefined. Device side abort is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results of an aborted kernel command are undefined. Device side abort is | |
The results produced by an aborted kernel command are undefined. Device side abort is |
For possible WIs that do not call abort, the return time is undefined and can | ||
adhere to the original control flow of the kernel as long as all the WIs | ||
eventually finish the execution and allow the host runtime to abort the | ||
process also in a single-threaded/non-interrupting implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For possible WIs that do not call abort, the return time is undefined and can | |
adhere to the original control flow of the kernel as long as all the WIs | |
eventually finish the execution and allow the host runtime to abort the | |
process also in a single-threaded/non-interrupting implementation. | |
For possible other WIs that do not call abort can run for an undefined and | |
adhere to the original control flow of the kernel until the host detects the abort | |
request and ramps down the execution. Work-items misbehaving (for example | |
by means of illegal memory accessess) after the abort has been called result in undefined behavior, as normally. |
@karolherbst could something along this lines cover your concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo: "Possible other WIs that do not call abort can run for an undefined time ..."
But yeah, I think this covers it pretty well.
With the device side abort extension, a work-item (WI) can return from the kernel execution at any point and cause abnormal unrecoverable termination of the host process.
This extension differs from the cl_arm_controlled_kernel_termination extension in that the abort in the device side is expected to behave like a call to the POSIX abort() in the host side, terminating also the host process immediately.
The extension is meant to easily support POSIX abort()-like functionality on the device side, as well as serve as a basis for CUDA/HIP-style assertions.
An example CPU implementation in PoCL: parmance/pocl@d5e88c7