Skip to content

Conversation

alec-flowers
Copy link
Contributor

@alec-flowers alec-flowers commented Oct 14, 2025

Overview:

We were using torch 2.7.1 since it has aarch cu128 wheel. However, started running into problems in the latest version bump so switching to cu129 which simplifies the installs a lot.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • ARM64 installations now use nightly PyTorch and TorchVision builds with CUDA-version-aware indexes.
    • Adds nightly Triton package support for ARM64.
  • Bug Fixes

    • Improves compatibility of GPU dependencies on ARM64 across different CUDA versions.
  • Chores

    • Updates dependency constraints to pin Torch and TorchVision; removes Torchaudio from constraints.
    • Updates installation messages to reflect nightly installs and clearer failure paths.

@alec-flowers alec-flowers requested review from a team as code owners October 14, 2025 21:39
@github-actions github-actions bot added the fix label Oct 14, 2025
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

Walkthrough

Switches ARM64 installs to PyTorch nightly builds chosen via CUDA_VERSION-derived index; installs torch, torchvision, and pytorch_triton from nightly. Removes torchaudio handling from constraints generation. Updates constraints to pin torch and torchvision. Adjusts success/failure messages. Keeps AMD64 and non-source build behavior unchanged.

Changes

Cohort / File(s) Summary
vLLM installer logic and constraints
container/deps/vllm/install_vllm.sh
For ARM64: derive nightly cu index from CUDA_VERSION; install torch, torchvision, pytorch_triton from nightly dev builds; update constraints to pin torch/torchvision and drop torchaudio; revise logs/messages; retain prior flow for AMD64 and non-source builds.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Script as install_vllm.sh
    participant Env as System Env
    participant PyTorchIdx as PyTorch Nightly Index
    participant Constraints as constraints.txt

    User->>Script: Run installer
    Script->>Env: Detect ARCH, CUDA_VERSION, build mode
    alt ARM64 with CUDA
        Script->>PyTorchIdx: Select nightly index (cu from CUDA_VERSION)
        Script->>PyTorchIdx: Install torch==*.dev, torchvision==*.dev
        Script->>PyTorchIdx: Install pytorch_triton (pre-release)
        Script->>Constraints: Write pins for torch, torchvision (no torchaudio)
        Script-->>User: Log nightly install success
    else Other (AMD64 / non-source)
        Script->>Env: Use existing install path
        Script-->>User: Log existing path taken
    end
    opt Failure
        Script-->>User: Log failure and exit non-zero
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I hop through nightly stars so bright,
Fetching torch by moonlit byte,
Triton tails and vision’s gleam,
CUDA whispers fuel the stream.
No audio crumbs—lighter trail—
Constraints pinned snug, we shall prevail.
Thump-thump: the build sails without fail.

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description uses the correct template headings and provides a high-level overview, but the “Details” and “Where should the reviewer start?” sections remain as placeholder comments and lack actual content, and the related issues section still contains a dummy issue number. Therefore, the description is incomplete and does not fully inform reviewers of what changed or where to focus. Please complete the “Details” section with a summary of the specific changes made (e.g., updates to install_vllm.sh, torch version bump, constraint adjustments) and specify the files or code areas reviewers should examine under “Where should the reviewer start?”, and replace the placeholder in the related issues section with the actual issue number.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Title Check ✅ Passed The title specifically highlights updating the aarch64 install path to target CUDA 12.9 which corresponds to the change in install_vllm.sh, making it clearly related to the pull request’s main modification.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c4130e and d173bfc.

📒 Files selected for processing (1)
  • container/deps/vllm/install_vllm.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-21T00:10:56.947Z
Learnt from: zaristei
PR: ai-dynamo/dynamo#2020
File: container/deps/vllm/install_vllm.sh:115-118
Timestamp: 2025-07-21T00:10:56.947Z
Learning: Graceful fallback for PyTorch wheel installation is broken on ARM architecture, so immediate exit on pinned version failure is preferred over fallback mechanisms in container/deps/vllm/install_vllm.sh for ARM64.

Applied to files:

  • container/deps/vllm/install_vllm.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo

Signed-off-by: alec-flowers <[email protected]>
Signed-off-by: alec-flowers <[email protected]>
@pull-request-size pull-request-size bot added size/M and removed size/S labels Oct 15, 2025
@alec-flowers alec-flowers changed the title fix: aarch64 path to do the same as upstream vLLM fix: aarch64 path to use cu129 Oct 15, 2025

# if libmlx5.so not shipped with 24.04 rdma-core packaging, CMAKE will fail when looking for
# generic dev name .so so we symlink .s0.1 -> .so
RUN ln -sf /usr/lib/aarch64-linux-gnu/libmlx5.so.1 /usr/lib/aarch64-linux-gnu/libmlx5.so || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably use uname here. Does this not happen with x86?

See https://github.com/sgl-project/sglang/blob/b2c856692092ddc8999520cafaf610cf9db8c8cd/docker/Dockerfile#L63-L64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this doesn't happen because x86 we are on an older ubuntu 24.04 that ships with the right .so. I want it to fail on x86 for now.

Copy link
Contributor

@ishandhanani ishandhanani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock but PTAL at my comment

echo "BUILD_START_TIME=${BUILD_START_TIME}" >> $GITHUB_ENV
echo "image_tag=$IMAGE_TAG" >> $GITHUB_OUTPUT
# Collect optional overrides provided by the workflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block seems a bit messy.

Can we instead set some reasonable defaults for the inputs here so we can just pass the args with the inputs more directly to the build.sh script? Seems like a very unwanted pattern of lots of if statements here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the thing is we have reasonable defaults already set inside the docker container. I don't really want to maintain it in both places, just to override it when necessary, which is what this code does.

--use-sccache \
--sccache-bucket "$SCCACHE_S3_BUCKET" \
--sccache-region "$AWS_DEFAULT_REGION"
--sccache-region "$AWS_DEFAULT_REGION" $EXTRA_ARGS
Copy link
Contributor

@dillon-cullinan dillon-cullinan Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With reasonable defaults, we should be able to just do the below and delete the if statements. Maybe the torch arg being an exception?

Suggested change
--sccache-region "$AWS_DEFAULT_REGION" $EXTRA_ARGS
--sccache-region "$AWS_DEFAULT_REGION" \
--base-image-tag ${{ inputs.base_image_tag }} \
--build-arg CUDA_VERSION=${{ inputs.cuda_version }} \
--build-arg RUNTIME_IMAGE_TAG=${{ inputs.runtime_image_tag }} \
--build-arg TORCH_BACKEND=${{ inputs.torch_backend }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants