-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Enable per-thread default stream in free-threading builds #133
Comments
From the perspective of numba-cuda I think this is a positive change and a good chance to make the change to the default. Numba does already support PTDS with the user explicitly setting it: https://numba.readthedocs.io/en/stable/reference/envvars.html#envvar-NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM The user can also explicitly force which kind of default stream they want: https://numba.readthedocs.io/en/stable/cuda-reference/host.html#numba.cuda.per_thread_default_stream / https://numba.readthedocs.io/en/stable/cuda-reference/host.html#numba.cuda.legacy_default_stream but I think this is orthogonal to what the default is. |
cc @kmaehashi for vis |
My worry about PTDS is that it is hare to sync across different threads’ default streams when the default value has a different meaning on every thread. Thread 2 cannot get a handle to thread 1’s default stream, if I understand correctly. Do we have a way for threads to give a handle or view on their own default stream to another thread? It seems like creating a pool of nonblocking streams is still necessary if synchronization across threads is needed — but there is a correctness concern if thread 2 wants to sync with work happening on thread 1’s default stream, because it has no way to represent that except for the magic default stream value which is a different “physical” stream for each thread. Example:
If I am understanding correctly, a library developer has no way to enforce stream safety across threads under PTDS if the user-provided stream value is the legacy stream, while disallowing legacy stream use entirely seems like overstepping. I may be misrepresenting the PTDS model — and would love to improve my understanding if this is not a concern. In practice this could cause bugs if the user passes legacy stream |
I had some discussions related to this topic offline. I am updating my position here a bit.
This is not so much of a concern. If threads wanted to synchronize across their (different) default streams, a shared event could be used to enforce synchronization. The main concern remains that "A PTDS build can lead to race conditions in code that is otherwise safe that users were already relying on" (from @vyasr), which can be created by a situation like what I proposed in the example above. If the user's code was implicitly relying on legacy stream At present, free-threaded Python users are likely to be early-adopters, who might be more open to experimental/unstable behaviors. However, if free-threaded is the default in the future, I don't want to worry about whether enabling PTDS inadvertently added race conditions to existing code. Moreover, I think the usability of PTDS is going to be limited unless we build every library with PTDS enabled. RAPIDS hopes to ship exactly one C++ wheel per library (for reasons like package size on PyPI and CI matrix growth), and that package is independent of the Python minor version, so shipping an additional PTDS-enabled package for free-threaded Python users is not something RAPIDS would be excited about adopting. |
tl;dr: For the Python 3.13 free-threading build (
cp313t
), the per-thread default stream is enabled and used by default. Users need to setCUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=0
to explicitly opt out and restore the old behavior.In CUDA, there are two kinds of default streams:
Today, CUDA Python offers a way to switch between the legacy and per-thread default streams (at the time of loading driver symbols) via the environment variable
CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM
, and the default has been as if it is set to 0 (so using the legacy default stream).However, it is a very common pitfall for performance-seeking applications and users who find themselves needing to create nonblocking streams explicitly to avoid implicit synchronization. This change would lift the need of creating nonblocking streams. This change would also allow GPU workloads launched from different host threads -- without an explicit stream in use -- to have an opportunity of overlapping and executing in parallel, instead of being serialized on the same (legacy default) stream.
The free threading build offers a natural opportunity and perfect timing for us to change the default to as if the env var is set to 1 and using the per-thread default stream. This also gives NVIDIA a path forward to assess the feasibility of deprecating (and eventually removing!) the legacy default stream, which has been a long-time quest we seek to conquer.
Users who use the regular build will not be affected, only those testing the experimental
cp313t
free-threading build will.The text was updated successfully, but these errors were encountered: