-
Notifications
You must be signed in to change notification settings - Fork 641
fix: aarch64 path to use cu129 #3624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughSwitches ARM64 installs to PyTorch nightly builds chosen via CUDA_VERSION-derived index; installs torch, torchvision, and pytorch_triton from nightly. Removes torchaudio handling from constraints generation. Updates constraints to pin torch and torchvision. Adjusts success/failure messages. Keeps AMD64 and non-source build behavior unchanged. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Script as install_vllm.sh
participant Env as System Env
participant PyTorchIdx as PyTorch Nightly Index
participant Constraints as constraints.txt
User->>Script: Run installer
Script->>Env: Detect ARCH, CUDA_VERSION, build mode
alt ARM64 with CUDA
Script->>PyTorchIdx: Select nightly index (cu from CUDA_VERSION)
Script->>PyTorchIdx: Install torch==*.dev, torchvision==*.dev
Script->>PyTorchIdx: Install pytorch_triton (pre-release)
Script->>Constraints: Write pins for torch, torchvision (no torchaudio)
Script-->>User: Log nightly install success
else Other (AMD64 / non-source)
Script->>Env: Use existing install path
Script-->>User: Log existing path taken
end
opt Failure
Script-->>User: Log failure and exit non-zero
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
container/deps/vllm/install_vllm.sh
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-21T00:10:56.947Z
Learnt from: zaristei
PR: ai-dynamo/dynamo#2020
File: container/deps/vllm/install_vllm.sh:115-118
Timestamp: 2025-07-21T00:10:56.947Z
Learning: Graceful fallback for PyTorch wheel installation is broken on ARM architecture, so immediate exit on pinned version failure is preferred over fallback mechanisms in container/deps/vllm/install_vllm.sh for ARM64.
Applied to files:
container/deps/vllm/install_vllm.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang
- GitHub Check: vllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
Signed-off-by: alec-flowers <[email protected]>
d173bfc
to
60924d4
Compare
Signed-off-by: alec-flowers <[email protected]>
Signed-off-by: alec-flowers <[email protected]>
f42ac6e
to
2e9b90b
Compare
|
||
# if libmlx5.so not shipped with 24.04 rdma-core packaging, CMAKE will fail when looking for | ||
# generic dev name .so so we symlink .s0.1 -> .so | ||
RUN ln -sf /usr/lib/aarch64-linux-gnu/libmlx5.so.1 /usr/lib/aarch64-linux-gnu/libmlx5.so || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably use uname
here. Does this not happen with x86?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this doesn't happen because x86 we are on an older ubuntu 24.04 that ships with the right .so. I want it to fail on x86 for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock but PTAL at my comment
echo "BUILD_START_TIME=${BUILD_START_TIME}" >> $GITHUB_ENV | ||
echo "image_tag=$IMAGE_TAG" >> $GITHUB_OUTPUT | ||
# Collect optional overrides provided by the workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block seems a bit messy.
Can we instead set some reasonable defaults for the inputs here so we can just pass the args with the inputs more directly to the build.sh
script? Seems like a very unwanted pattern of lots of if
statements here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the thing is we have reasonable defaults already set inside the docker container. I don't really want to maintain it in both places, just to override it when necessary, which is what this code does.
--use-sccache \ | ||
--sccache-bucket "$SCCACHE_S3_BUCKET" \ | ||
--sccache-region "$AWS_DEFAULT_REGION" | ||
--sccache-region "$AWS_DEFAULT_REGION" $EXTRA_ARGS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With reasonable defaults, we should be able to just do the below and delete the if statements. Maybe the torch arg being an exception?
--sccache-region "$AWS_DEFAULT_REGION" $EXTRA_ARGS | |
--sccache-region "$AWS_DEFAULT_REGION" \ | |
--base-image-tag ${{ inputs.base_image_tag }} \ | |
--build-arg CUDA_VERSION=${{ inputs.cuda_version }} \ | |
--build-arg RUNTIME_IMAGE_TAG=${{ inputs.runtime_image_tag }} \ | |
--build-arg TORCH_BACKEND=${{ inputs.torch_backend }} |
Overview:
We were using torch 2.7.1 since it has aarch cu128 wheel. However, started running into problems in the latest version bump so switching to cu129 which simplifies the installs a lot.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Bug Fixes
Chores