Fix CUDA plugin CI. #8593

ysiraichi · 2025-01-21T12:41:49Z

This PR reverts #8286, and bumps CUDA version to 12.3. The latter is needed for successfully compiling GPU dependent source code that makes use of CUgraphConditionalHandle (not available in 12.1) driver API typedef.

amjames · 2025-01-22T14:21:01Z

Looks like the failing jobs are due to a failed clone from kleidiai's gitlab. Is that a widespread issue or spurious failure?

ysiraichi · 2025-01-22T15:02:06Z

It doesn't look widespread (haven't seen in other PRs). I will try rebasing this PR.

tengyifei · 2025-01-22T18:32:38Z

@ysiraichi from pytorch/pytorch#138609 (comment), it looks like PyTorch upstream decided to release with some specific set of CUDA versions (see issue). Can we use one of their chosen versions, for example CUDA 12.4 instead of CUDA 12.3?

ysiraichi · 2025-01-22T18:51:20Z

Problem is: I didn't find a docker image with CUDA 12.4. Also, I'm not sure how to create one, since it seems something internal.

tengyifei · 2025-01-22T19:52:55Z

Problem is: I didn't find a docker image with CUDA 12.4. Also, I'm not sure how to create one, since it seems something internal.

Could you clarify this challenge? Do you mean that you were hoping to find a torch_xla CUDA 12.4 docker build?

ysiraichi · 2025-01-22T20:02:07Z

As far as I understand, PyTorch/XLA CI relies on docker images (see dev-image). My point is that there is no docker image with CUDA 12.4 in that registry.

xla/.github/workflows/build_and_test.yml

Lines 44 to 52 in fbbdfca

    
           # build-torch-with-cuda: 
        
           #   name: "Build PyTorch with CUDA" 
        
           #   uses: ./.github/workflows/_build_torch_with_cuda.yml 
        
           #   needs: get-torch-commit 
        
           #   with: 
        
           #     # note that to build a torch wheel with CUDA enabled, we do not need a GPU runner. 
        
           #     dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1 
        
           #     torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}} 
        
           #     runner: linux.24xlarge

tengyifei · 2025-01-23T22:59:09Z

As far as I understand, PyTorch/XLA CI relies on docker images (see dev-image). My point is that there is no docker image with CUDA 12.4 in that registry.

@ysiraichi got it. thanks for the explanation. I think using CUDA 12.3 for now is a-okay. IIUC, most of the time we're only using torch CPU + torch_xla GPU in any case.

tengyifei · 2025-01-23T22:59:45Z

LMK when I should review. It looks like there are still some failed tests.

ysiraichi · 2025-02-04T19:50:27Z

After discussing this issue with @lsy323, I think the issue could be due to the older GPU in these instances. Is it okay if I try them out with a linux.g4dn.12xlarge.nvidia.gpu?

Current: g3.8xlarge 2x Tesla M60 (CC 5.2)
New: g4dn.12xlarge 4x Tesla T4 (CC 7.5)

@miladm @tengyifei
Let me know what you think.

tengyifei · 2025-02-04T19:57:31Z

Seems reasonable to me -- for comparison, we have created similar or more advanced dev VMs before

This reverts commit da18622.

This reverts commit d8fba62.

This reverts commit 9db596b.

ysiraichi · 2025-02-06T12:52:16Z

There are some GPU tests and Triton tests that are still failing, right now. I think we should merge this PR, skipping them for now. I will make sure to open issues for each of them.

@miladm @tengyifei What do you think?

tengyifei · 2025-02-06T22:29:24Z

I think we should merge this PR, skipping them for now. I will make sure to open issues for each of them.

That sounds good.

tengyifei · 2025-02-06T22:34:35Z

.github/workflows/build_and_test.yml

-  #     runner: linux.24xlarge
+  build-torch-with-cuda:
+    name: "Build PyTorch with CUDA"
+    uses: ./.github/workflows/_build_torch_with_cuda.yml


I didn't know that we also need to build PyTorch with CUDA. Does building PyTorch with CPU not work for some reason? If we need to build PyTorch with CUDA, do you know if 12.3 is a supported CUDA version to build PyTorch with? From pytorch/pytorch#138609 it looks like upstream picked 12.4 or 12.6

Does building PyTorch with CPU not work for some reason?

There are a few tests that only work when PyTorch is also built with CUDA support. That said, I don't know whether we are actually testing them on CI.

xla/test/test_operations.py

Lines 165 to 168 in 065cb5b

def onlyIfTorchSupportsCUDA(fn):

return unittest.skipIf(

not torch.cuda.is_available(), reason="requires PyTorch CUDA support")(

fn)

Regarding CUDA version, I don't think it's a supported version. However, I believe it should compile fine. That said, I think it would be better to change it to a supported version.

Got it. I think we can address this as a follow up.

Could you add a TODO here mentioning #8700

ysiraichi · 2025-02-11T18:04:55Z

Update: there are still a few CI failures

Add an skip to them
Track the status of each of them in an issue

ysiraichi · 2025-02-12T12:38:46Z

@tengyifei All tests seem to be passing! Could you take a look at this PR?

tengyifei

Thanks!

tengyifei · 2025-02-12T19:09:27Z

test/torch_distributed/test_ddp.py

@@ -38,6 +39,8 @@ def _ddp_correctness(rank,
  def test_ddp_correctness(self):
    torch_xla.launch(self._ddp_correctness, args=(False, FLAGS.debug))

+  # Ref: https://github.com/pytorch/xla/pull/8593


Did we file separate issue(s) to fix these failures?

Not yet. But I will do that right now!

ysiraichi · 2025-02-12T19:23:22Z

Oops. I think GitHub is waiting for either @ManfeiBai @lsy323 or @zpcore to accept it.

ysiraichi added the xla:gpu label Jan 21, 2025

ysiraichi force-pushed the fix-cuda-plugin-compilation branch from b5474c1 to afc5707 Compare January 22, 2025 15:04

ysiraichi marked this pull request as ready for review January 29, 2025 19:05

ysiraichi force-pushed the fix-cuda-plugin-compilation branch from 44b72a2 to 100385a Compare February 3, 2025 15:18

ysiraichi added 14 commits February 5, 2025 16:27

Bump CUDA version to 12.3.

efd32f4

Update Github actions.

52d3417

Fix build YAML.

298d5b5

Add torch pin for Sep 30, 2024 commit.

6eb6b66

Downgrade CUDA to 12.1.

c3e3dd3

Revert "Downgrade CUDA to 12.1."

3dd5c11

This reverts commit da18622.

Restrict PyTorch with CUDA build to 12 parallel jobs.

d002cc1

Dump machine details.

3363f25

Revert "Add torch pin for Sep 30, 2024 commit."

58c4b0d

This reverts commit d8fba62.

Increase MAX_JOBS to 24.

f069638

Revert "Dump machine details."

54a93a9

This reverts commit 9db596b.

Use new CUDA for triton tests.

b3d5fef

Use g4dn.12xlarge for running GPU tests.

eba99a5

Install PyTorch's triton.

75f1deb

ysiraichi force-pushed the fix-cuda-plugin-compilation branch from e7c2351 to 75f1deb Compare February 5, 2025 19:28

Use vars.yml definition for picking CUDA architectures.

960bbf4

tengyifei reviewed Feb 6, 2025

View reviewed changes

ysiraichi added 3 commits February 11, 2025 11:27

Skipping failing CI tests for now.

bbceeb3

Move skipIfCUDA.

9dfdb17

Fix lint issue.

5f59b02

ysiraichi mentioned this pull request Feb 11, 2025

Transition to Hermetic CUDA #8665

Closed

Compile for Maxwell cards.

348ffb6

ysiraichi requested review from lsy323, ManfeiBai and zpcore as code owners February 11, 2025 18:19

Add TODO comment.

7945212

tengyifei approved these changes Feb 12, 2025

View reviewed changes

lsy323 approved these changes Feb 12, 2025

View reviewed changes

lsy323 merged commit 42edbe1 into master Feb 12, 2025
23 checks passed

ysiraichi mentioned this pull request Feb 13, 2025

Fix skipped tests on GPU CI. #8706

Closed

2 tasks

ysiraichi mentioned this pull request Mar 6, 2025

[Tracking] @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build after updating XLA pin #8199

Closed

ysiraichi mentioned this pull request Mar 17, 2025

test_ddp_correctness_with_gradient_as_bucket_view fails for multi-device CUDA #8841

Open

	def onlyIfTorchSupportsCUDA(fn):
	return unittest.skipIf(
	not torch.cuda.is_available(), reason="requires PyTorch CUDA support")(
	fn)

Fix CUDA plugin CI. #8593

Fix CUDA plugin CI. #8593

Uh oh!

Conversation

ysiraichi commented Jan 21, 2025

Uh oh!

amjames commented Jan 22, 2025

Uh oh!

ysiraichi commented Jan 22, 2025

Uh oh!

tengyifei commented Jan 22, 2025

Uh oh!

ysiraichi commented Jan 22, 2025

Uh oh!

tengyifei commented Jan 22, 2025

Uh oh!

ysiraichi commented Jan 22, 2025

Uh oh!

tengyifei commented Jan 23, 2025

Uh oh!

tengyifei commented Jan 23, 2025

Uh oh!

ysiraichi commented Feb 4, 2025

Uh oh!

tengyifei commented Feb 4, 2025

Uh oh!

ysiraichi commented Feb 6, 2025

Uh oh!

tengyifei commented Feb 6, 2025

Uh oh!

tengyifei Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

ysiraichi Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

ysiraichi Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

tengyifei Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

tengyifei Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

ysiraichi commented Feb 11, 2025

Uh oh!

ysiraichi commented Feb 12, 2025

Uh oh!

tengyifei left a comment

Choose a reason for hiding this comment

Uh oh!

tengyifei Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

ysiraichi Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

ysiraichi commented Feb 12, 2025

Uh oh!

Uh oh!

Uh oh!