[DONT MERGE] Make stream creation faster #677
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is for discussion only.
Currently stream and other objects creation is very slow when compared to CuPy
The cause between the performance difference is that in CuPy, the Stream constructor is entirely Cythonized and all that calls that it does to the runtime/driver APIs are direct calls to the c functions by cimporting the corresponding modules.
The two major overheads that we see in cuda-python are:
weakref.finalize
for destruction and_MembersNeededForFinalize
taking around 20% of the_init
timeIf we naively rename the
_stream.py
to_stream.pyx
and Cythonize it, we don't get any benefit.The cythonized code is dealing with several python objects and every call to the bindings needs to go to through the interpreter and convert the argument/returns to python objects. Moreover, the error check of a binding call is also done in Python and it becomes one of the most noticeable bottlenecks. Even in the plain python flamegraph you can see how
handle_return
appear several times having a noticeable impact.After testing the changes done here we get the following time for cuda-python
I don't want to merge this PR as it is, the current approach is not good and needs a lot of discussion. The aim of this PR is to show where current bottlenecks are and some ideas to work around them. For example:
I think we should delegate error checking to the
cuda.bindings
package similarly to what CuPy does. Going in/out the python interpreter for a binding call and then the error check is not cheap when dealing with times ~1usThere should be an easy way to use directly the cdef functions of the bindings. However, we don't want to
cimport
bindings stuff in cuda.core because this requires compile with the cuda headers and we will need to maintain various packages. Maybe it is better to add some baseStream
andEvent
classes incuda.bindings
and other objects that are performance sensitive and build over them incuda.core
The destruction of some objects with
weakref
is expensive. We may want to switch to__del__
with interpreter shutdown detection like CuPy does, or Cythonize these objects and use__dealloc__
cuda-python flamegraph with main branch

cuda-python flamegraph with this pr

CuPy latest release flamegraph
