fix: aarch64 path to use cu129 #3624

alec-flowers · 2025-10-14T21:39:36Z

Overview:

We were using torch 2.7.1 since it has aarch cu128 wheel. However, started running into problems in the latest version bump so switching to cu129 which simplifies the installs a lot.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- ARM64 installations now use nightly PyTorch and TorchVision builds with CUDA-version-aware indexes.
- Adds nightly Triton package support for ARM64.
Bug Fixes
- Improves compatibility of GPU dependencies on ARM64 across different CUDA versions.
Chores
- Updates dependency constraints to pin Torch and TorchVision; removes Torchaudio from constraints.
- Updates installation messages to reflect nightly installs and clearer failure paths.

coderabbitai · 2025-10-14T21:39:48Z

Walkthrough

Switches ARM64 installs to PyTorch nightly builds chosen via CUDA_VERSION-derived index; installs torch, torchvision, and pytorch_triton from nightly. Removes torchaudio handling from constraints generation. Updates constraints to pin torch and torchvision. Adjusts success/failure messages. Keeps AMD64 and non-source build behavior unchanged.

Changes

Cohort / File(s)	Summary
vLLM installer logic and constraints `container/deps/vllm/install_vllm.sh`	For ARM64: derive nightly cu index from CUDA_VERSION; install torch, torchvision, pytorch_triton from nightly dev builds; update constraints to pin torch/torchvision and drop torchaudio; revise logs/messages; retain prior flow for AMD64 and non-source builds.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Script as install_vllm.sh
    participant Env as System Env
    participant PyTorchIdx as PyTorch Nightly Index
    participant Constraints as constraints.txt

    User->>Script: Run installer
    Script->>Env: Detect ARCH, CUDA_VERSION, build mode
    alt ARM64 with CUDA
        Script->>PyTorchIdx: Select nightly index (cu from CUDA_VERSION)
        Script->>PyTorchIdx: Install torch==*.dev, torchvision==*.dev
        Script->>PyTorchIdx: Install pytorch_triton (pre-release)
        Script->>Constraints: Write pins for torch, torchvision (no torchaudio)
        Script-->>User: Log nightly install success
    else Other (AMD64 / non-source)
        Script->>Env: Use existing install path
        Script-->>User: Log existing path taken
    end
    opt Failure
        Script-->>User: Log failure and exit non-zero
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I hop through nightly stars so bright,
Fetching torch by moonlit byte,
Triton tails and vision’s gleam,
CUDA whispers fuel the stream.
No audio crumbs—lighter trail—
Constraints pinned snug, we shall prevail.
Thump-thump: the build sails without fail.

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description uses the correct template headings and provides a high-level overview, but the “Details” and “Where should the reviewer start?” sections remain as placeholder comments and lack actual content, and the related issues section still contains a dummy issue number. Therefore, the description is incomplete and does not fully inform reviewers of what changed or where to focus.	Please complete the “Details” section with a summary of the specific changes made (e.g., updates to install_vllm.sh, torch version bump, constraint adjustments) and specify the files or code areas reviewers should examine under “Where should the reviewer start?”, and replace the placeholder in the related issues section with the actual issue number.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Title Check	✅ Passed	The title specifically highlights updating the aarch64 install path to target CUDA 12.9 which corresponds to the change in install_vllm.sh, making it clearly related to the pull request’s main modification.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c4130e and d173bfc.

📒 Files selected for processing (1)

container/deps/vllm/install_vllm.sh (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-21T00:10:56.947Z

Learnt from: zaristei
PR: ai-dynamo/dynamo#2020
File: container/deps/vllm/install_vllm.sh:115-118
Timestamp: 2025-07-21T00:10:56.947Z
Learning: Graceful fallback for PyTorch wheel installation is broken on ARM architecture, so immediate exit on pinned version failure is preferred over fallback mechanisms in container/deps/vllm/install_vllm.sh for ARM64.

Applied to files:

container/deps/vllm/install_vllm.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: trtllm (amd64)
GitHub Check: sglang
GitHub Check: vllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo

container/deps/vllm/install_vllm.sh

Signed-off-by: alec-flowers <[email protected]>

ishandhanani · 2025-10-15T05:34:49Z

container/Dockerfile.vllm


+# if libmlx5.so not shipped with 24.04 rdma-core packaging, CMAKE will fail when looking for
+# generic dev name .so so we symlink .s0.1 -> .so
+RUN ln -sf /usr/lib/aarch64-linux-gnu/libmlx5.so.1 /usr/lib/aarch64-linux-gnu/libmlx5.so || true


You should probably use uname here. Does this not happen with x86?

See https://github.com/sgl-project/sglang/blob/b2c856692092ddc8999520cafaf610cf9db8c8cd/docker/Dockerfile#L63-L64

No this doesn't happen because x86 we are on an older ubuntu 24.04 that ships with the right .so. I want it to fail on x86 for now.

ishandhanani

Approving to unblock but PTAL at my comment

dillon-cullinan · 2025-10-15T05:59:46Z

.github/actions/docker-build/action.yml

        echo "BUILD_START_TIME=${BUILD_START_TIME}" >> $GITHUB_ENV

        echo "image_tag=$IMAGE_TAG" >> $GITHUB_OUTPUT
+        # Collect optional overrides provided by the workflow


This code block seems a bit messy.

Can we instead set some reasonable defaults for the inputs here so we can just pass the args with the inputs more directly to the build.sh script? Seems like a very unwanted pattern of lots of if statements here.

I think the thing is we have reasonable defaults already set inside the docker container. I don't really want to maintain it in both places, just to override it when necessary, which is what this code does.

dillon-cullinan · 2025-10-15T06:17:02Z

.github/actions/docker-build/action.yml

          --use-sccache \
          --sccache-bucket "$SCCACHE_S3_BUCKET" \
-          --sccache-region "$AWS_DEFAULT_REGION"
+          --sccache-region "$AWS_DEFAULT_REGION" $EXTRA_ARGS


With reasonable defaults, we should be able to just do the below and delete the if statements. Maybe the torch arg being an exception?

Suggested change

--sccache-region "$AWS_DEFAULT_REGION" $EXTRA_ARGS

--sccache-region "$AWS_DEFAULT_REGION" \

--base-image-tag ${{ inputs.base_image_tag }} \

--build-arg CUDA_VERSION=${{ inputs.cuda_version }} \

--build-arg RUNTIME_IMAGE_TAG=${{ inputs.runtime_image_tag }} \

--build-arg TORCH_BACKEND=${{ inputs.torch_backend }}

container/deps/vllm/install_vllm.sh

alec-flowers requested review from a team as code owners October 14, 2025 21:39

pull-request-size bot added the size/S label Oct 14, 2025

github-actions bot added the fix label Oct 14, 2025

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

container/deps/vllm/install_vllm.sh Outdated Show resolved Hide resolved

amd cu129 install

60924d4

Signed-off-by: alec-flowers <[email protected]>

alec-flowers force-pushed the aflowers/fix-torch-error branch from d173bfc to 60924d4 Compare October 14, 2025 22:59

pull-request-size bot added size/XS and removed size/S labels Oct 14, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 22:59 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 14, 2025 23:01 Inactive

fix linker issue

f9010cd

Signed-off-by: alec-flowers <[email protected]>

pull-request-size bot added size/S and removed size/XS labels Oct 15, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 02:13 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 02:14 Inactive

pull-request-size bot added size/M and removed size/S labels Oct 15, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 02:23 Inactive

alec-flowers changed the title ~~fix: aarch64 path to do the same as upstream vLLM~~ fix: aarch64 path to use cu129 Oct 15, 2025

add custom logic to aarch build pipeline

2e9b90b

Signed-off-by: alec-flowers <[email protected]>

alec-flowers force-pushed the aflowers/fix-torch-error branch from f42ac6e to 2e9b90b Compare October 15, 2025 02:34

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 02:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 02:39 Inactive

ishandhanani reviewed Oct 15, 2025

View reviewed changes

ishandhanani approved these changes Oct 15, 2025

View reviewed changes

dillon-cullinan requested changes Oct 15, 2025

View reviewed changes

dillon-cullinan reviewed Oct 15, 2025

View reviewed changes

nv-tusharma reviewed Oct 15, 2025

View reviewed changes

container/deps/vllm/install_vllm.sh Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: aarch64 path to use cu129 #3624

fix: aarch64 path to use cu129 #3624

alec-flowers commented Oct 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ishandhanani Oct 15, 2025

Uh oh!

alec-flowers Oct 15, 2025

Uh oh!

ishandhanani left a comment

Uh oh!

dillon-cullinan Oct 15, 2025

Uh oh!

alec-flowers Oct 15, 2025

Uh oh!

dillon-cullinan Oct 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-          --sccache-region "$AWS_DEFAULT_REGION" $EXTRA_ARGS
+          --sccache-region "$AWS_DEFAULT_REGION" \
+          --base-image-tag ${{ inputs.base_image_tag }} \
+          --build-arg CUDA_VERSION=${{ inputs.cuda_version }} \
+          --build-arg RUNTIME_IMAGE_TAG=${{ inputs.runtime_image_tag }} \
+          --build-arg TORCH_BACKEND=${{ inputs.torch_backend }}

fix: aarch64 path to use cu129 #3624

Are you sure you want to change the base?

fix: aarch64 path to use cu129 #3624

Conversation

alec-flowers commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ishandhanani Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

alec-flowers Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani left a comment

Choose a reason for hiding this comment

Uh oh!

dillon-cullinan Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

alec-flowers Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

dillon-cullinan Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alec-flowers commented Oct 14, 2025 •

edited

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

dillon-cullinan Oct 15, 2025 •

edited

Loading