Skip to content

Make Device.set_current() faster #781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

leofang
Copy link
Member

@leofang leofang commented Jul 25, 2025

Description

Still WIP.

closes #739

Before this PR:

In [4]: %timeit dev.set_current()
1.4 μs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With this PR:

In [4]: %timeit dev.set_current()
374 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We want to minimize the number of calls to CUDA APIs, which add up to the overheads (several hundreds of nanoseconds per call). This is achieved by combining two changes:

  1. We lazily retain a reference to the primary context of each device
    • The only scenario where this could fail is if someone calls cudaDeviceReset() somewhere, in which case the pointers to the primary contests would be invalidated and become dangling. However, the CUDA team strongly discourages using this API (it does not solve any real issue, e.g. restoring from sticky errors), and it is impossible to get things right especially in the Python land if it is called. Too many CUDA-related objects floating around.
  2. We unconditionally set the needed primary context to current without extra checks

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented Jul 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang self-assigned this Jul 25, 2025
@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jul 25, 2025
@leofang leofang added this to the cuda.core beta 6 milestone Jul 25, 2025
@leofang leofang requested a review from shwina July 25, 2025 04:09
@leofang leofang marked this pull request as ready for review July 25, 2025 04:09
@leofang
Copy link
Member Author

leofang commented Jul 25, 2025

/ok to test 92d24ce

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

Device.set_current() is slow
1 participant