[ci] [CUDA] Switch to GitHub runner for GPU CI #6958

letmaik · 2025-07-02T17:12:26Z

This PR switches to a GitHub hosted runner for the GPU CI. If all works ok, this will avoid any dependence on Microsoft managing the internal runners.

letmaik · 2025-07-02T19:03:09Z

I'm seeing quite a few intermittent issues with the GitHub GPU runner in not being able to access the GPU either from the beginning or sometime during the job, while sometimes it runs through fine. I've raised an issue with GitHub support to look at it.

For reference, those two might be related:
NVIDIA/nvidia-container-toolkit#48
https://github.com/orgs/community/discussions/146879

jameslamb

Oh wow, thank you!!

It would be AWESOME to be able to use GitHub-hosted runners for the CUDA jobs instead. It should be fine that those only have T4s... with GPU CI here, we're really just testing that the CUDA version of the library minimally can be built and the tests pass... we haven't had the resources to try to test coverage of different GPU architectures, for example.

I'm seeing quite a few intermittent issues with the GitHub GPU runner in not being able to access the GPU either from the beginning or sometime during the job

Yeah, I looked at that failed CUDA 11 job on the most recent run, see tons of these:

[LightGBM] [Fatal] [CUDA] no CUDA-capable device is detected /tmp/pip-req-build-c2yfcsg8/src/io/cuda/cuda_column_data.cpp 18

(build link)

I don't see any obvious root causes in the logs. I think you're right to suspect that it's a problem with the runner itself.

I've raised an issue with GitHub support to look at it.

Was this a private support issue? If not, could you link it so I could subscribe?

letmaik · 2025-07-03T04:36:27Z

Was this a private support issue? If not, could you link it so I could subscribe?

Yes, it's a private "premium" support issue, typically those lead to faster outcomes but it might still take a while to diagnose it, doesn't look like a super clear issue to me.

jameslamb · 2025-07-03T04:42:04Z

Have you been able to narrow it down to a subset of the jobs?

For example, if it's only the CUDA 11.8 job, we could consider:

using a newer Ubuntu version for CUDA 11 container images (maybe that'd be enough to fix it?)
dropping CUDA 11 CI here entirely (even RAPIDS is currently in the process of doing that: https://docs.rapids.ai/notices/rsn0048/)

Even if we had to drop CUDA 11 CI, I think it'd be worth it in exchange for removing the manual runner maintenance by Microsoft.

letmaik · 2025-07-03T04:50:39Z

So far the 12.2.2 source build always succeeded, while 12.8.0 wheel failed once and succeeded once. cuda 11.8.0 pip always failed. I'll run a few more attempts, but I'm pretty sure it's noise as you sometimes see that nvidia-smi succeeds at the beginning and then the GPU later goes away, or nvidia-smi fails at the beginning already. So it seems more like: the runner is doing something that makes the GPU go away when some event happens that we don't control, and the longer the job runs the more likely it gets.

jameslamb · 2025-07-26T03:09:53Z

On the most recent run (build link) only the CUDA 11.8 job failed... and that was with the one test failure from #6703 , not the types of issues like losing connection to the GPU described above.

I'm going to try a few more re-runs. Will keep updating this comment.

build link	CUDA 12.8 wheel	CUDA 12.2 source	CUDA 11.8 bdist
run 1	✅	✅	❌ (#6703)
run 2	❌ (#6703)	✅	✅
run 3	✅	✅	✅
run 4	✅	✅	✅
run 5	✅	✅	✅
run 6	✅	✅	✅
run 7	✅	✅	✅
run 8	✅	✅	✅
run 9	✅	✅	✅
run 10	✅	✅	✅

jameslamb · 2025-07-26T03:39:19Z

If all works ok, this will avoid any dependence on Microsoft managing the internal runners.

I also want to mention one other benefit... with this change, we get higher concurrency and therefore reduced end-to-end time for CI!!! With the self-hosted runner, we only have 1 box and so only get 1 CI job at a time for CUDA CI across all workflow runs for all commits / PRs.

Now we could get multiple jobs at the same time 😁

.github/workflows/cuda.yml

jameslamb

@letmaik it looks to me like this is working!!!

Saw that over a few re-runs: #6958 (comment)

And I think it's totally fine to run LightGBM's tests on T4s.

I think this should be merged, it's a really nice improvement for the long-term health of the project.

The other CI failures are unrelated issues that have accumulated over the last week:

failing docs build: #6978
tests incompatible w/ pandas 3.0: #6980
Azure DevOps pool auth issues: #6949 (comment)

letmaik · 2025-07-27T20:44:47Z

@jameslamb Let's stress test this a little more to make sure it's really working reliably. Maybe do 5 more run attempts? I haven't heard back yet from GitHub Enterprise Support on the original issue unfortunately.

jameslamb · 2025-07-27T20:51:34Z

Ok sure! I can do that. I think I'll do the next round with new empty commits instead of clicking "re-run all jobs", just in case that affects anything.

StrikerRUS

🚀

jameslamb · 2025-07-28T04:42:49Z

After 5 more runs, this still looks to be working well!

Updated #6958 (comment)

I think we can + should merge this, what do you think @letmaik ? Also either way, please do let us know in the future if you get a response on your GitHub support ticket.

letmaik · 2025-07-28T07:06:56Z

@jameslamb Alright, let's do it. And even if there's still the occasional failure in the future, it's easy to re-run a job.

jameslamb · 2025-07-28T13:34:23Z

Alright great, thanks!

letmaik · 2025-08-04T09:49:14Z

@jameslamb I got a response from GitHub support and they said they couldn't reproduce the issue. They mentioned that they released a new NVIDIA image last week with version 20250730.36.1 which updates the GPU driver, but I checked all the runs we had and those were all run with the older image 20250716.20.1, so it's not related to that.

If you observe any issues with lost GPUs, please ping me, then I can engage with support again.

jameslamb · 2025-08-04T14:28:25Z

Ok will do, thanks again for all your help!

switch to github runner for GPU CI

460c50c

letmaik requested review from guolinke, jameslamb, jmoralez, borchero and StrikerRUS as code owners July 2, 2025 17:12

jameslamb added the maintenance label Jul 3, 2025

jameslamb changed the title ~~Switch to GitHub runner for GPU CI~~ [ci] [CUDA] Switch to GitHub runner for GPU CI Jul 3, 2025

jameslamb reviewed Jul 3, 2025

View reviewed changes

jameslamb mentioned this pull request Jul 4, 2025

[RFC] [ci] remove Azure DevOps CI jobs? #6949

Open

Merge branch 'master' into letmaik/gpu-hosted-runner

a394d13

shiyu1994 self-requested a review as a code owner July 24, 2025 03:22

jameslamb self-assigned this Jul 26, 2025

jameslamb reviewed Jul 26, 2025

View reviewed changes

.github/workflows/cuda.yml Outdated Show resolved Hide resolved

Update .github/workflows/cuda.yml

e32b40a

jameslamb approved these changes Jul 26, 2025

View reviewed changes

Merge branch 'master' into letmaik/gpu-hosted-runner

3b53a0b

empty commit

189a1ca

StrikerRUS approved these changes Jul 27, 2025

View reviewed changes

jameslamb added 3 commits July 27, 2025 17:06

empty commit

01bab2a

empty commit

8d5ee83

Merge branch 'master' into letmaik/gpu-hosted-runner

62eae11

letmaik merged commit dcb5d97 into master Jul 28, 2025
76 of 78 checks passed

letmaik deleted the letmaik/gpu-hosted-runner branch July 28, 2025 07:07

[ci] [CUDA] Switch to GitHub runner for GPU CI #6958

[ci] [CUDA] Switch to GitHub runner for GPU CI #6958

Conversation

letmaik commented Jul 2, 2025

Uh oh!

letmaik commented Jul 2, 2025

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

letmaik commented Jul 3, 2025

Uh oh!

jameslamb commented Jul 3, 2025

Uh oh!

letmaik commented Jul 3, 2025

Uh oh!

jameslamb commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameslamb commented Jul 26, 2025

Uh oh!

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

letmaik commented Jul 27, 2025

Uh oh!

jameslamb commented Jul 27, 2025

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

jameslamb commented Jul 28, 2025

Uh oh!

letmaik commented Jul 28, 2025

Uh oh!

Uh oh!

jameslamb commented Jul 28, 2025

Uh oh!

letmaik commented Aug 4, 2025

Uh oh!

jameslamb commented Aug 4, 2025

Uh oh!

Uh oh!

jameslamb commented Jul 26, 2025 •

edited

Loading