-
Notifications
You must be signed in to change notification settings - Fork 34
Adding Linux CI runner setup docs #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more than documentation. It should be moved to somewhere more like https://github.com/ROCm/TheRock/tree/main/build_tools/github_action . I've used build_tools/github_actions/runner
on other projects (note the plural, "GitHub Actions" is the branding: https://github.com/features/actions)
# Add Docker's official GPG key: | ||
sudo apt-get update | ||
sudo apt-get install ca-certificates curl -y | ||
sudo install -m 0755 -d /etc/apt/keyrings | ||
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc | ||
sudo chmod a+r /etc/apt/keyrings/docker.asc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link to some official documentation for how this is the "official GPG key" or these are the recommended setup steps.
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ | ||
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to support more Linux distributions than Ubuntu at some point. Could put "ubuntu" in the script name somewhere.
docs/ci_runner_setup/linux/README.md
Outdated
- If you have a different Linux distribution, follow [ROCm installation quick start guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) | ||
- <b>After reboot, please try `rocminfo` and `rocm-smi` to make sure ROCm is loaded and drivers are installed.</b> If there are issues, please try each command in `rocm_install.sh` instead. | ||
|
||
1. If docker is not installed, please run `sudo ./docker_install.sh`. This script will download docker for Ubuntu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all the commands in the script need sudo
if you always run the script itself with sudo
?
docs/ci_runner_setup/linux/README.md
Outdated
For brand new machines that do not that ROCm or Docker installed, please follow these steps. Otherwise, please skip to step 3. | ||
|
||
1. Install ROCm to the machine using `sudo ./rocm_install.sh`. This script will install ROCm 6.4 and AMD drivers for Ubuntu24, then it will reboot the system. | ||
|
||
- Rebooting the system is required to load ROCm. | ||
- If you have a different Linux distribution, follow [ROCm installation quick start guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) | ||
- <b>After reboot, please try `rocminfo` and `rocm-smi` to make sure ROCm is loaded and drivers are installed.</b> If there are issues, please try each command in `rocm_install.sh` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we installing ROCm on machines? TheRock is ROCm, so we should be building whatever we need as part of our build/test/release workflows. If we need something specific for bootstrapping, let's extract that instead of pulling the full SDK down from some fixed older release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When setting up the scripts (particuarly for ROCR_VISIBLE_DEVICES) and debugging the machines, it's quite useful to have rocminfo and rocm-smi around. However, it's only used for those commands and the entire other rocm items aren't used.
Should we just use a version of TheRock during setup, figure out which GPUs is what, then remove it? so that way the machines have a fresh system with no ROCm installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with the latter. We can pin a known green commit in a config file and either pull a release tarball or the specific artifacts that contain rocm-smi
and rocminfo
. We could also bundle only those two pre-build binaries in an extra package if it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what is needed at runner setup time with visible devices?
In order of my preference:
- No tool dependency for system setup, or use some standard Linux tooling
- Minimal set of tools, bootstrapped from a stable release of TheRock
- Minimal set of tools, bootstrapped from existing ROCm releases (maybe mirrored to S3)
- apt install as here
I wouldn't trust test runners if we install ROCm (TheRock, community build) on top of an existing ROCm (non-community build) install. If we run setup on the host and then runners under Docker, that might be safer though.
|
||
# runner setup | ||
mkdir "actions-runner-$1" && cd "actions-runner-$1" | ||
curl -o actions-runner-linux-x64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-linux-x64-2.323.0.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the runner version an argument to the script, or do something to select the latest?
# svc install | ||
sudo ./svc.sh install root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is svc? Where is this svc.sh
script coming from?
docs/ci_runner_setup/linux/README.md
Outdated
|
||
### Setup | ||
|
||
For brand new machines that do not that ROCm or Docker installed, please follow these steps. Otherwise, please skip to step 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar/typo errors in this line, but also see my other comments about ROCm installs.
docs/ci_runner_setup/linux/README.md
Outdated
``` | ||
cd {IDENTIFIER_FROM_STEP_3} | ||
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: set the language for fenced code blocks when know to help with syntax highlighting (auto detection sometimes guesses correctly)
``` | |
cd {IDENTIFIER_FROM_STEP_3} | |
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL} | |
``` | |
```bash | |
cd {IDENTIFIER_FROM_STEP_3} | |
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL} |
docs/ci_runner_setup/linux/README.md
Outdated
1. After the runner packages are there, please follow these steps and run the commands: | ||
|
||
- Please retrieve token from [ROCm GitHub runner page](https://github.com/organizations/ROCm/settings/actions/runners/new?arch=x64&os=linux) in the `Configure` tab. | ||
- Please add an unique identifying label for this CI runner. Example: Linux gfx1201 -> label `linux-gfx1201-gpu-rocm`. This is the label that will be used in workflows and will be shared amongst other identical machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could provide a list of all current labels for reference instead of a single example. Maybe a link to the runner page would be sufficient, but a table with more information here would be useful for those without access to that page.
sudo apt install gfortran git git-lfs ninja-build cmake g++ pkg-config xxd patchelf automake python3-venv python3-dev libegl1-mesa-dev | ||
|
||
# svc install | ||
# This script comes from GitHub action runner tar file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
# This script comes from GitHub action runner tar file | |
# This script comes from GitHub actions runner release tarball |
? Furthermore, I thought that this only has runsvc.sh
, but I might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may have been a previous script! currently, it has ./svc.sh as the script (must have updated the name?)
# ROCm install | ||
wget https://repo.radeon.com/amdgpu-install/6.4/ubuntu/noble/amdgpu-install_6.4.60400-1_all.deb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking "reviewed" until we resolve the rocm install questions. We could punt on that if the existing runners are already doing this and this is just checking in the configuration.
No description provided.