Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CUDA plugin CI. #8593

Merged
merged 20 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 4 additions & 8 deletions .github/workflows/_build_torch_with_cuda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ jobs:
image: ${{ inputs.dev-image }}
env:
_GLIBCXX_USE_CXX11_ABI: 0
TORCH_CUDA_ARCH_LIST: "7.0;7.5;8.0;9.0"
USE_CUDA: 1
MAX_JOBS: 24
steps:
- name: Checkout actions
uses: actions/checkout@v4
Expand All @@ -34,18 +37,11 @@ jobs:
with:
torch-commit: ${{ inputs.torch-commit }}
cuda: true
- name: Checkout PyTorch Repo
uses: actions/checkout@v4
with:
repository: pytorch/pytorch
path: pytorch
ref: ${{ inputs.torch-commit }}
submodules: recursive
- name: Build PyTorch with CUDA enabled
shell: bash
run: |
cd pytorch
TORCH_CUDA_ARCH_LIST="5.2;8.6" USE_CUDA=1 MAX_JOBS="$(nproc --ignore=4)" python setup.py bdist_wheel
python setup.py bdist_wheel
- name: Upload wheel
uses: actions/upload-artifact@v4
with:
Expand Down
8 changes: 6 additions & 2 deletions .github/workflows/_test_requiring_torch_cuda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,12 @@ jobs:
pip install -U --pre jaxlib -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
pip install -U --pre jax-cuda12-pjrt jax-cuda12-plugin -f https://storage.googleapis.com/jax-releases/jax_cuda_plugin_nightly_releases.html
pip install -U --pre jax -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html
pip install --no-deps triton==2.3.0
if: ${{ matrix.run_triton_tests }}
- name: Install Triton
shell: bash
run: |
cd pytorch
make triton
- name: Python Tests
shell: bash
run: |
Expand All @@ -106,5 +110,5 @@ jobs:
- name: Triton Tests
shell: bash
run: |
PJRT_DEVICE=CUDA TRITON_PTXAS_PATH=/usr/local/cuda-12.1/bin/ptxas python pytorch/xla/test/test_triton.py
PJRT_DEVICE=CUDA TRITON_PTXAS_PATH=/usr/local/cuda-12.3/bin/ptxas python pytorch/xla/test/test_triton.py
if: ${{ matrix.run_triton_tests }}
82 changes: 39 additions & 43 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,25 +42,23 @@ jobs:
secrets:
gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}

# Disable due to https://github.com/pytorch/xla/issues/8199
# build-torch-with-cuda:
# name: "Build PyTorch with CUDA"
# uses: ./.github/workflows/_build_torch_with_cuda.yml
# needs: get-torch-commit
# with:
# # note that to build a torch wheel with CUDA enabled, we do not need a GPU runner.
# dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1
# torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
# runner: linux.24xlarge
build-torch-with-cuda:
name: "Build PyTorch with CUDA"
uses: ./.github/workflows/_build_torch_with_cuda.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that we also need to build PyTorch with CUDA. Does building PyTorch with CPU not work for some reason? If we need to build PyTorch with CUDA, do you know if 12.3 is a supported CUDA version to build PyTorch with? From pytorch/pytorch#138609 it looks like upstream picked 12.4 or 12.6

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does building PyTorch with CPU not work for some reason?

There are a few tests that only work when PyTorch is also built with CUDA support. That said, I don't know whether we are actually testing them on CI.

xla/test/test_operations.py

Lines 165 to 168 in 065cb5b

def onlyIfTorchSupportsCUDA(fn):
return unittest.skipIf(
not torch.cuda.is_available(), reason="requires PyTorch CUDA support")(
fn)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding CUDA version, I don't think it's a supported version. However, I believe it should compile fine. That said, I think it would be better to change it to a supported version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think we can address this as a follow up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO here mentioning #8700

needs: get-torch-commit
with:
# note that to build a torch wheel with CUDA enabled, we do not need a GPU runner.
dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.3
torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
runner: linux.24xlarge

# Disable due to https://github.com/pytorch/xla/issues/8199
# build-cuda-plugin:
# name: "Build XLA CUDA plugin"
# uses: ./.github/workflows/_build_plugin.yml
# with:
# dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1
# secrets:
# gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}
build-cuda-plugin:
name: "Build XLA CUDA plugin"
uses: ./.github/workflows/_build_plugin.yml
with:
dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.3
secrets:
gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}

test-python-cpu:
name: "CPU tests"
Expand All @@ -74,32 +72,30 @@ jobs:
secrets:
gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}

# Disable due to https://github.com/pytorch/xla/issues/8199
# test-cuda:
# name: "GPU tests"
# uses: ./.github/workflows/_test.yml
# needs: [build-torch-xla, build-cuda-plugin, get-torch-commit]
# with:
# dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1
# runner: linux.8xlarge.nvidia.gpu
# timeout-minutes: 300
# collect-coverage: false
# install-cuda-plugin: true
# torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
# secrets:
# gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}
test-cuda:
name: "GPU tests"
uses: ./.github/workflows/_test.yml
needs: [build-torch-xla, build-cuda-plugin, get-torch-commit]
with:
dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.3
runner: linux.g4dn.12xlarge.nvidia.gpu
timeout-minutes: 300
collect-coverage: false
install-cuda-plugin: true
torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
secrets:
gcloud-service-key: ${{ secrets.GCLOUD_SERVICE_KEY }}

# Disable due to https://github.com/pytorch/xla/issues/8199
# test-cuda-with-pytorch-cuda-enabled:
# name: "GPU tests requiring torch CUDA"
# uses: ./.github/workflows/_test_requiring_torch_cuda.yml
# needs: [build-torch-with-cuda, build-torch-xla, build-cuda-plugin, get-torch-commit]
# with:
# dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1
# runner: linux.8xlarge.nvidia.gpu
# timeout-minutes: 300
# collect-coverage: false
# torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
test-cuda-with-pytorch-cuda-enabled:
name: "GPU tests requiring torch CUDA"
uses: ./.github/workflows/_test_requiring_torch_cuda.yml
needs: [build-torch-with-cuda, build-torch-xla, build-cuda-plugin, get-torch-commit]
with:
dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.3
runner: linux.8xlarge.nvidia.gpu
timeout-minutes: 300
collect-coverage: false
torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}

test-tpu:
name: "TPU tests"
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/setup/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ runs:
- name: Setup CUDA environment
shell: bash
run: |
echo "PATH=$PATH:/usr/local/cuda-12.1/bin" >> $GITHUB_ENV
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.1/lib64" >> $GITHUB_ENV
echo "PATH=$PATH:/usr/local/cuda-12.3/bin" >> $GITHUB_ENV
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.3/lib64" >> $GITHUB_ENV
if: ${{ inputs.cuda }}
- name: Setup gcloud
shell: bash
Expand Down
Loading