Skip to content

Adding Linux CI runner setup docs #490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

geomin12
Copy link
Contributor

No description provided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more than documentation. It should be moved to somewhere more like https://github.com/ROCm/TheRock/tree/main/build_tools/github_action . I've used build_tools/github_actions/runner on other projects (note the plural, "GitHub Actions" is the branding: https://github.com/features/actions)

Comment on lines 3 to 8
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl -y
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to some official documentation for how this is the "official GPG key" or these are the recommended setup steps.

Comment on lines 12 to 13
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to support more Linux distributions than Ubuntu at some point. Could put "ubuntu" in the script name somewhere.

- If you have a different Linux distribution, follow [ROCm installation quick start guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html)
- <b>After reboot, please try `rocminfo` and `rocm-smi` to make sure ROCm is loaded and drivers are installed.</b> If there are issues, please try each command in `rocm_install.sh` instead.

1. If docker is not installed, please run `sudo ./docker_install.sh`. This script will download docker for Ubuntu.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all the commands in the script need sudo if you always run the script itself with sudo?

Comment on lines 9 to 15
For brand new machines that do not that ROCm or Docker installed, please follow these steps. Otherwise, please skip to step 3.

1. Install ROCm to the machine using `sudo ./rocm_install.sh`. This script will install ROCm 6.4 and AMD drivers for Ubuntu24, then it will reboot the system.

- Rebooting the system is required to load ROCm.
- If you have a different Linux distribution, follow [ROCm installation quick start guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html)
- <b>After reboot, please try `rocminfo` and `rocm-smi` to make sure ROCm is loaded and drivers are installed.</b> If there are issues, please try each command in `rocm_install.sh` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we installing ROCm on machines? TheRock is ROCm, so we should be building whatever we need as part of our build/test/release workflows. If we need something specific for bootstrapping, let's extract that instead of pulling the full SDK down from some fixed older release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When setting up the scripts (particuarly for ROCR_VISIBLE_DEVICES) and debugging the machines, it's quite useful to have rocminfo and rocm-smi around. However, it's only used for those commands and the entire other rocm items aren't used.

Should we just use a version of TheRock during setup, figure out which GPUs is what, then remove it? so that way the machines have a fresh system with no ROCm installed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with the latter. We can pin a known green commit in a config file and either pull a release tarball or the specific artifacts that contain rocm-smi and rocminfo. We could also bundle only those two pre-build binaries in an extra package if it helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what is needed at runner setup time with visible devices?

In order of my preference:

  • No tool dependency for system setup, or use some standard Linux tooling
  • Minimal set of tools, bootstrapped from a stable release of TheRock
  • Minimal set of tools, bootstrapped from existing ROCm releases (maybe mirrored to S3)
  • apt install as here

I wouldn't trust test runners if we install ROCm (TheRock, community build) on top of an existing ROCm (non-community build) install. If we run setup on the host and then runners under Docker, that might be safer though.


# runner setup
mkdir "actions-runner-$1" && cd "actions-runner-$1"
curl -o actions-runner-linux-x64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-linux-x64-2.323.0.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the runner version an argument to the script, or do something to select the latest?

Comment on lines 10 to 11
# svc install
sudo ./svc.sh install root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is svc? Where is this svc.sh script coming from?


### Setup

For brand new machines that do not that ROCm or Docker installed, please follow these steps. Otherwise, please skip to step 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar/typo errors in this line, but also see my other comments about ROCm installs.

Comment on lines 26 to 29
```
cd {IDENTIFIER_FROM_STEP_3}
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: set the language for fenced code blocks when know to help with syntax highlighting (auto detection sometimes guesses correctly)

Suggested change
```
cd {IDENTIFIER_FROM_STEP_3}
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL}
```
```bash
cd {IDENTIFIER_FROM_STEP_3}
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL}

1. After the runner packages are there, please follow these steps and run the commands:

- Please retrieve token from [ROCm GitHub runner page](https://github.com/organizations/ROCm/settings/actions/runners/new?arch=x64&os=linux) in the `Configure` tab.
- Please add an unique identifying label for this CI runner. Example: Linux gfx1201 -> label `linux-gfx1201-gpu-rocm`. This is the label that will be used in workflows and will be shared amongst other identical machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could provide a list of all current labels for reference instead of a single example. Maybe a link to the runner page would be sufficient, but a table with more information here would be useful for those without access to that page.

@geomin12 geomin12 requested a review from ScottTodd May 1, 2025 18:30
sudo apt install gfortran git git-lfs ninja-build cmake g++ pkg-config xxd patchelf automake python3-venv python3-dev libegl1-mesa-dev

# svc install
# This script comes from GitHub action runner tar file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Suggested change
# This script comes from GitHub action runner tar file
# This script comes from GitHub actions runner release tarball

? Furthermore, I thought that this only has runsvc.sh, but I might be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may have been a previous script! currently, it has ./svc.sh as the script (must have updated the name?)

@geomin12 geomin12 requested a review from marbre May 2, 2025 16:56
@amd-chrissosa amd-chrissosa self-assigned this May 8, 2025
Comment on lines +3 to +4
# ROCm install
wget https://repo.radeon.com/amdgpu-install/6.4/ubuntu/noble/amdgpu-install_6.4.60400-1_all.deb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking "reviewed" until we resolve the rocm install questions. We could punt on that if the existing runners are already doing this and this is just checking in the configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: TODO
Development

Successfully merging this pull request may close these issues.

4 participants