Skip to content

[ci] [CUDA] Switch to GitHub runner for GPU CI #6958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

letmaik
Copy link
Member

@letmaik letmaik commented Jul 2, 2025

This PR switches to a GitHub hosted runner for the GPU CI. If all works ok, this will avoid any dependence on Microsoft managing the internal runners.

@letmaik
Copy link
Member Author

letmaik commented Jul 2, 2025

I'm seeing quite a few intermittent issues with the GitHub GPU runner in not being able to access the GPU either from the beginning or sometime during the job, while sometimes it runs through fine. I've raised an issue with GitHub support to look at it.

For reference, those two might be related:
NVIDIA/nvidia-container-toolkit#48
https://github.com/orgs/community/discussions/146879

@jameslamb jameslamb changed the title Switch to GitHub runner for GPU CI [ci] [CUDA] Switch to GitHub runner for GPU CI Jul 3, 2025
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, thank you!!

It would be AWESOME to be able to use GitHub-hosted runners for the CUDA jobs instead. It should be fine that those only have T4s... with GPU CI here, we're really just testing that the CUDA version of the library minimally can be built and the tests pass... we haven't had the resources to try to test coverage of different GPU architectures, for example.

I'm seeing quite a few intermittent issues with the GitHub GPU runner in not being able to access the GPU either from the beginning or sometime during the job

Yeah, I looked at that failed CUDA 11 job on the most recent run, see tons of these:

[LightGBM] [Fatal] [CUDA] no CUDA-capable device is detected /tmp/pip-req-build-c2yfcsg8/src/io/cuda/cuda_column_data.cpp 18

(build link)

I don't see any obvious root causes in the logs. I think you're right to suspect that it's a problem with the runner itself.

I've raised an issue with GitHub support to look at it.

Was this a private support issue? If not, could you link it so I could subscribe?

@letmaik
Copy link
Member Author

letmaik commented Jul 3, 2025

Was this a private support issue? If not, could you link it so I could subscribe?

Yes, it's a private "premium" support issue, typically those lead to faster outcomes but it might still take a while to diagnose it, doesn't look like a super clear issue to me.

@jameslamb
Copy link
Collaborator

Have you been able to narrow it down to a subset of the jobs?

For example, if it's only the CUDA 11.8 job, we could consider:

  • using a newer Ubuntu version for CUDA 11 container images (maybe that'd be enough to fix it?)
  • dropping CUDA 11 CI here entirely (even RAPIDS is currently in the process of doing that: https://docs.rapids.ai/notices/rsn0048/)

Even if we had to drop CUDA 11 CI, I think it'd be worth it in exchange for removing the manual runner maintenance by Microsoft.

@letmaik
Copy link
Member Author

letmaik commented Jul 3, 2025

So far the 12.2.2 source build always succeeded, while 12.8.0 wheel failed once and succeeded once. cuda 11.8.0 pip always failed. I'll run a few more attempts, but I'm pretty sure it's noise as you sometimes see that nvidia-smi succeeds at the beginning and then the GPU later goes away, or nvidia-smi fails at the beginning already. So it seems more like: the runner is doing something that makes the GPU go away when some event happens that we don't control, and the longer the job runs the more likely it gets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants