[DONT MERGE] Make stream creation faster #677

emcastillo · 2025-06-04T07:20:27Z

This is for discussion only.

Currently stream and other objects creation is very slow when compared to CuPy

>>> import timeit
>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
5.1670654399786144
>>> timeit.timeit('cp.cuda.Stream()', setup='import cupy as cp')
2.4941531461663544

The cause between the performance difference is that in CuPy, the Stream constructor is entirely Cythonized and all that calls that it does to the runtime/driver APIs are direct calls to the c functions by cimporting the corresponding modules.

The two major overheads that we see in cuda-python are:

The weakref.finalize for destruction and _MembersNeededForFinalize taking around 20% of the _init time
Options object manipulation, an additional 20~30% of the time

If we naively rename the _stream.py to _stream.pyx and Cythonize it, we don't get any benefit.
The cythonized code is dealing with several python objects and every call to the bindings needs to go to through the interpreter and convert the argument/returns to python objects. Moreover, the error check of a binding call is also done in Python and it becomes one of the most noticeable bottlenecks. Even in the plain python flamegraph you can see how handle_return appear several times having a noticeable impact.

After testing the changes done here we get the following time for cuda-python

>>> timeit.timeit('device.create_stream()', setup='import cuda.core.experimental as ccx; device = ccx.Device(); device.set_current()')
2.2055921980645508

I don't want to merge this PR as it is, the current approach is not good and needs a lot of discussion. The aim of this PR is to show where current bottlenecks are and some ideas to work around them. For example:

I think we should delegate error checking to the cuda.bindings package similarly to what CuPy does. Going in/out the python interpreter for a binding call and then the error check is not cheap when dealing with times ~1us
There should be an easy way to use directly the cdef functions of the bindings. However, we don't want to cimport bindings stuff in cuda.core because this requires compile with the cuda headers and we will need to maintain various packages. Maybe it is better to add some base Stream and Event classes in cuda.bindings and other objects that are performance sensitive and build over them in cuda.core
The destruction of some objects with weakref is expensive. We may want to switch to __del__ with interpreter shutdown detection like CuPy does, or Cythonize these objects and use __dealloc__

cuda-python flamegraph with main branch

cuda-python flamegraph with this pr

CuPy latest release flamegraph

copy-pr-bot · 2025-06-04T07:20:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

[DONT MERGE] Make faster

059829c

emcastillo mentioned this pull request Jun 4, 2025

[FEA]: Faster initialization time for cuda.core abstractions #658

Open

1 task

leofang assigned emcastillo Jun 4, 2025

leofang added this to the cuda.core parking lot milestone Jun 4, 2025

leofang added enhancement Any code-related improvements triage Needs the team's attention P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jun 4, 2025

leofang mentioned this pull request Jun 6, 2025

Switch to use CUDA driver APIs in Device constructor #460

Merged

leofang mentioned this pull request Jun 13, 2025

WIP: Cythonize away some perf hot spots #709

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DONT MERGE] Make stream creation faster #677

[DONT MERGE] Make stream creation faster #677

Uh oh!

emcastillo commented Jun 4, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jun 4, 2025

Uh oh!

Uh oh!

[DONT MERGE] Make stream creation faster #677

Are you sure you want to change the base?

[DONT MERGE] Make stream creation faster #677

Uh oh!

Conversation

emcastillo commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jun 4, 2025

Uh oh!

Uh oh!

emcastillo commented Jun 4, 2025 •

edited

Loading