-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CUDA plugin CI. #8593
Fix CUDA plugin CI. #8593
Conversation
Looks like the failing jobs are due to a failed clone from kleidiai's gitlab. Is that a widespread issue or spurious failure? |
It doesn't look widespread (haven't seen in other PRs). I will try rebasing this PR. |
b5474c1
to
afc5707
Compare
@ysiraichi from pytorch/pytorch#138609 (comment), it looks like PyTorch upstream decided to release with some specific set of CUDA versions (see issue). Can we use one of their chosen versions, for example CUDA 12.4 instead of CUDA 12.3? |
Problem is: I didn't find a docker image with CUDA 12.4. Also, I'm not sure how to create one, since it seems something internal. |
Could you clarify this challenge? Do you mean that you were hoping to find a |
As far as I understand, PyTorch/XLA CI relies on docker images (see xla/.github/workflows/build_and_test.yml Lines 44 to 52 in fbbdfca
|
@ysiraichi got it. thanks for the explanation. I think using CUDA 12.3 for now is a-okay. IIUC, most of the time we're only using torch CPU + torch_xla GPU in any case. |
LMK when I should review. It looks like there are still some failed tests. |
44b72a2
to
100385a
Compare
After discussing this issue with @lsy323, I think the issue could be due to the older GPU in these instances. Is it okay if I try them out with a Current: @miladm @tengyifei |
Seems reasonable to me -- for comparison, we have created similar or more advanced dev VMs before |
e7c2351
to
75f1deb
Compare
There are some GPU tests and Triton tests that are still failing, right now. I think we should merge this PR, skipping them for now. I will make sure to open issues for each of them. @miladm @tengyifei What do you think? |
That sounds good. |
# runner: linux.24xlarge | ||
build-torch-with-cuda: | ||
name: "Build PyTorch with CUDA" | ||
uses: ./.github/workflows/_build_torch_with_cuda.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that we also need to build PyTorch with CUDA. Does building PyTorch with CPU not work for some reason? If we need to build PyTorch with CUDA, do you know if 12.3
is a supported CUDA version to build PyTorch with? From pytorch/pytorch#138609 it looks like upstream picked 12.4 or 12.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does building PyTorch with CPU not work for some reason?
There are a few tests that only work when PyTorch is also built with CUDA support. That said, I don't know whether we are actually testing them on CI.
Lines 165 to 168 in 065cb5b
def onlyIfTorchSupportsCUDA(fn): | |
return unittest.skipIf( | |
not torch.cuda.is_available(), reason="requires PyTorch CUDA support")( | |
fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding CUDA version, I don't think it's a supported version. However, I believe it should compile fine. That said, I think it would be better to change it to a supported version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think we can address this as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a TODO here mentioning #8700
Update: there are still a few CI failures
|
@tengyifei All tests seem to be passing! Could you take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -38,6 +39,8 @@ def _ddp_correctness(rank, | |||
def test_ddp_correctness(self): | |||
torch_xla.launch(self._ddp_correctness, args=(False, FLAGS.debug)) | |||
|
|||
# Ref: https://github.com/pytorch/xla/pull/8593 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we file separate issue(s) to fix these failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet. But I will do that right now!
Oops. I think GitHub is waiting for either @ManfeiBai @lsy323 or @zpcore to accept it. |
Fix: #8577
This PR reverts #8286, and bumps CUDA version to 12.3. The latter is needed for successfully compiling GPU dependent source code that makes use of
CUgraphConditionalHandle
(not available in 12.1) driver API typedef.