Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CUDA plugin CI. #8593

Merged
merged 20 commits into from
Feb 12, 2025
Merged

Fix CUDA plugin CI. #8593

merged 20 commits into from
Feb 12, 2025

Conversation

ysiraichi
Copy link
Collaborator

Fix: #8577

This PR reverts #8286, and bumps CUDA version to 12.3. The latter is needed for successfully compiling GPU dependent source code that makes use of CUgraphConditionalHandle (not available in 12.1) driver API typedef.

@amjames
Copy link
Collaborator

amjames commented Jan 22, 2025

Looks like the failing jobs are due to a failed clone from kleidiai's gitlab. Is that a widespread issue or spurious failure?

@ysiraichi
Copy link
Collaborator Author

It doesn't look widespread (haven't seen in other PRs). I will try rebasing this PR.

@ysiraichi ysiraichi force-pushed the fix-cuda-plugin-compilation branch from b5474c1 to afc5707 Compare January 22, 2025 15:04
@tengyifei
Copy link
Collaborator

@ysiraichi from pytorch/pytorch#138609 (comment), it looks like PyTorch upstream decided to release with some specific set of CUDA versions (see issue). Can we use one of their chosen versions, for example CUDA 12.4 instead of CUDA 12.3?

@ysiraichi
Copy link
Collaborator Author

Problem is: I didn't find a docker image with CUDA 12.4. Also, I'm not sure how to create one, since it seems something internal.

@tengyifei
Copy link
Collaborator

Problem is: I didn't find a docker image with CUDA 12.4. Also, I'm not sure how to create one, since it seems something internal.

Could you clarify this challenge? Do you mean that you were hoping to find a torch_xla CUDA 12.4 docker build?

@ysiraichi
Copy link
Collaborator Author

As far as I understand, PyTorch/XLA CI relies on docker images (see dev-image). My point is that there is no docker image with CUDA 12.4 in that registry.

# build-torch-with-cuda:
# name: "Build PyTorch with CUDA"
# uses: ./.github/workflows/_build_torch_with_cuda.yml
# needs: get-torch-commit
# with:
# # note that to build a torch wheel with CUDA enabled, we do not need a GPU runner.
# dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1
# torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
# runner: linux.24xlarge

@tengyifei
Copy link
Collaborator

As far as I understand, PyTorch/XLA CI relies on docker images (see dev-image). My point is that there is no docker image with CUDA 12.4 in that registry.

@ysiraichi got it. thanks for the explanation. I think using CUDA 12.3 for now is a-okay. IIUC, most of the time we're only using torch CPU + torch_xla GPU in any case.

@tengyifei
Copy link
Collaborator

LMK when I should review. It looks like there are still some failed tests.

@ysiraichi ysiraichi marked this pull request as ready for review January 29, 2025 19:05
@ysiraichi ysiraichi force-pushed the fix-cuda-plugin-compilation branch from 44b72a2 to 100385a Compare February 3, 2025 15:18
@ysiraichi
Copy link
Collaborator Author

After discussing this issue with @lsy323, I think the issue could be due to the older GPU in these instances. Is it okay if I try them out with a linux.g4dn.12xlarge.nvidia.gpu?

Current: g3.8xlarge 2x Tesla M60 (CC 5.2)
New: g4dn.12xlarge 4x Tesla T4 (CC 7.5)

@miladm @tengyifei
Let me know what you think.

@tengyifei
Copy link
Collaborator

Seems reasonable to me -- for comparison, we have created similar or more advanced dev VMs before

@ysiraichi ysiraichi force-pushed the fix-cuda-plugin-compilation branch from e7c2351 to 75f1deb Compare February 5, 2025 19:28
@ysiraichi
Copy link
Collaborator Author

There are some GPU tests and Triton tests that are still failing, right now. I think we should merge this PR, skipping them for now. I will make sure to open issues for each of them.

@miladm @tengyifei What do you think?

@tengyifei
Copy link
Collaborator

I think we should merge this PR, skipping them for now. I will make sure to open issues for each of them.

That sounds good.

# runner: linux.24xlarge
build-torch-with-cuda:
name: "Build PyTorch with CUDA"
uses: ./.github/workflows/_build_torch_with_cuda.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that we also need to build PyTorch with CUDA. Does building PyTorch with CPU not work for some reason? If we need to build PyTorch with CUDA, do you know if 12.3 is a supported CUDA version to build PyTorch with? From pytorch/pytorch#138609 it looks like upstream picked 12.4 or 12.6

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does building PyTorch with CPU not work for some reason?

There are a few tests that only work when PyTorch is also built with CUDA support. That said, I don't know whether we are actually testing them on CI.

xla/test/test_operations.py

Lines 165 to 168 in 065cb5b

def onlyIfTorchSupportsCUDA(fn):
return unittest.skipIf(
not torch.cuda.is_available(), reason="requires PyTorch CUDA support")(
fn)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding CUDA version, I don't think it's a supported version. However, I believe it should compile fine. That said, I think it would be better to change it to a supported version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think we can address this as a follow up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO here mentioning #8700

@ysiraichi
Copy link
Collaborator Author

Update: there are still a few CI failures

  • Add an skip to them
  • Track the status of each of them in an issue

@ysiraichi
Copy link
Collaborator Author

@tengyifei All tests seem to be passing! Could you take a look at this PR?

Copy link
Collaborator

@tengyifei tengyifei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -38,6 +39,8 @@ def _ddp_correctness(rank,
def test_ddp_correctness(self):
torch_xla.launch(self._ddp_correctness, args=(False, FLAGS.debug))

# Ref: https://github.com/pytorch/xla/pull/8593
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we file separate issue(s) to fix these failures?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. But I will do that right now!

@ysiraichi
Copy link
Collaborator Author

Oops. I think GitHub is waiting for either @ManfeiBai @lsy323 or @zpcore to accept it.

@lsy323 lsy323 merged commit 42edbe1 into master Feb 12, 2025
23 checks passed
@ysiraichi ysiraichi mentioned this pull request Feb 13, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bring back PyTorch/XLA GPU tests/builds
4 participants