Skip to content

[DONT MERGE] Make stream creation faster #677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

emcastillo
Copy link

@emcastillo emcastillo commented Jun 4, 2025

This is for discussion only.

Currently stream and other objects creation is very slow when compared to CuPy

>>> import timeit
>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
5.1670654399786144
>>> timeit.timeit('cp.cuda.Stream()', setup='import cupy as cp')
2.4941531461663544

The cause between the performance difference is that in CuPy, the Stream constructor is entirely Cythonized and all that calls that it does to the runtime/driver APIs are direct calls to the c functions by cimporting the corresponding modules.

The two major overheads that we see in cuda-python are:

  • The weakref.finalize for destruction and _MembersNeededForFinalize taking around 20% of the _init time
  • Options object manipulation, an additional 20~30% of the time

If we naively rename the _stream.py to _stream.pyx and Cythonize it, we don't get any benefit.
The cythonized code is dealing with several python objects and every call to the bindings needs to go to through the interpreter and convert the argument/returns to python objects. Moreover, the error check of a binding call is also done in Python and it becomes one of the most noticeable bottlenecks. Even in the plain python flamegraph you can see how handle_return appear several times having a noticeable impact.

After testing the changes done here we get the following time for cuda-python

>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
2.2055921980645508

I don't want to merge this PR as it is, the current approach is not good and needs a lot of discussion. The aim of this PR is to show where current bottlenecks are and some ideas to work around them. For example:

  • I think we should delegate error checking to the cuda.bindings package similarly to what CuPy does. Going in/out the python interpreter for a binding call and then the error check is not cheap when dealing with times ~1us

  • There should be an easy way to use directly the cdef functions of the bindings. However, we don't want to cimport bindings stuff in cuda.core because this requires compile with the cuda headers and we will need to maintain various packages. Maybe it is better to add some base Stream and Event classes in cuda.bindings and other objects that are performance sensitive and build over them in cuda.core

  • The destruction of some objects with weakref is expensive. We may want to switch to __del__ with interpreter shutdown detection like CuPy does, or Cythonize these objects and use __dealloc__

cuda-python flamegraph with main branch
cuda_pyhon_main

cuda-python flamegraph with this pr
cuda_python_this_pr

CuPy latest release flamegraph
cupy

Copy link
Contributor

copy-pr-bot bot commented Jun 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang added this to the cuda.core parking lot milestone Jun 4, 2025
@leofang leofang added enhancement Any code-related improvements triage Needs the team's attention P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do! triage Needs the team's attention
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants