-
Notifications
You must be signed in to change notification settings - Fork 35
Adding Linux CI runner setup docs #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,78 @@ | ||||||||||||||||
## Linux CI runner setup | ||||||||||||||||
|
||||||||||||||||
This directory contains documentation and scripts about setting up a Linux CI Runner for [`ROCm`](https://github.com/ROCm) organization and used by [`TheRock`](https://github.com/ROCm/TheRock) repository. | ||||||||||||||||
|
||||||||||||||||
Note: you must have sufficient permissions to access [ROCm runner page](https://github.com/organizations/ROCm/settings/actions/runners) | ||||||||||||||||
|
||||||||||||||||
### Setup | ||||||||||||||||
|
||||||||||||||||
For brand new machines that do not that ROCm or Docker installed, please follow these steps. Otherwise, please skip to step 3. | ||||||||||||||||
|
||||||||||||||||
1. Install ROCm to the machine using `sudo ./rocm_install.sh`. This script will install ROCm 6.4 and AMD drivers for Ubuntu24, then it will reboot the system. | ||||||||||||||||
|
||||||||||||||||
- Rebooting the system is required to load ROCm. | ||||||||||||||||
- If you have a different Linux distribution, follow [ROCm installation quick start guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) | ||||||||||||||||
- <b>After reboot, please try `rocminfo` and `rocm-smi` to make sure ROCm is loaded and drivers are installed.</b> If there are issues, please try each command in `rocm_install.sh` instead. | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we installing ROCm on machines? TheRock is ROCm, so we should be building whatever we need as part of our build/test/release workflows. If we need something specific for bootstrapping, let's extract that instead of pulling the full SDK down from some fixed older release. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When setting up the scripts (particuarly for ROCR_VISIBLE_DEVICES) and debugging the machines, it's quite useful to have rocminfo and rocm-smi around. However, it's only used for those commands and the entire other rocm items aren't used. Should we just use a version of TheRock during setup, figure out which GPUs is what, then remove it? so that way the machines have a fresh system with no ROCm installed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would go with the latter. We can pin a known green commit in a config file and either pull a release tarball or the specific artifacts that contain There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate on what is needed at runner setup time with visible devices? In order of my preference:
I wouldn't trust test runners if we install ROCm (TheRock, community build) on top of an existing ROCm (non-community build) install. If we run setup on the host and then runners under Docker, that might be safer though. |
||||||||||||||||
|
||||||||||||||||
1. If docker is not installed, please run `sudo ./docker_install.sh`. This script will download docker for Ubuntu. | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do all the commands in the script need |
||||||||||||||||
|
||||||||||||||||
1. After ROCm and Docker are installed, please run `sudo ./runner_setup_1.sh {IDENTIFIER}`. There may be multiple GPUs per system, so please add an identifier to make this runner unique and easily understood. Examples: gfx1201 GPU -> `sudo ./runner_setup_1.sh gfx1201-gpu-1` | ||||||||||||||||
|
||||||||||||||||
1. After the runner packages are there, please follow these steps and run the commands: | ||||||||||||||||
|
||||||||||||||||
- Please retrieve token from [ROCm GitHub runner page](https://github.com/organizations/ROCm/settings/actions/runners/new?arch=x64&os=linux) in the `Configure` tab. | ||||||||||||||||
- Please add an unique identifying label for this CI runner. Example: Linux gfx1201 -> label `linux-gfx1201-gpu-rocm`. This is the label that will be used in workflows and will be shared amongst other identical machines. | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could provide a list of all current labels for reference instead of a single example. Maybe a link to the runner page would be sufficient, but a table with more information here would be useful for those without access to that page. |
||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
cd {IDENTIFIER_FROM_STEP_3} | ||||||||||||||||
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL} | ||||||||||||||||
``` | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: set the language for fenced code blocks when know to help with syntax highlighting (auto detection sometimes guesses correctly)
Suggested change
|
||||||||||||||||
|
||||||||||||||||
- During the config.sh setup step: | ||||||||||||||||
- `Default` is fine for runner group | ||||||||||||||||
- For "name of runner," please include an unique identifier for this runner. Example: for runner gfx1201, `linux-gfx1201-gpu-rocm-1`. A good practice is to have `{LABEL}-{ID}`. Remember, label != name of runner, there may be many gfx1201 machines sharing the label `linux-gfx1201-gpu-rocm`. | ||||||||||||||||
- `_work` is fine for work folder. | ||||||||||||||||
|
||||||||||||||||
1. After ./config.sh script has been completed, please follow these steps and run the commands: | ||||||||||||||||
|
||||||||||||||||
- For your CI runner to run on a specific GPU, you will need to obtain the correct `{ROCR_VISIBLE_DEVICE}`. | ||||||||||||||||
- To get this, please run `rocminfo` and figure out which `Node` your GPU is running on. Example: | ||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
******* | ||||||||||||||||
Agent 10 | ||||||||||||||||
******* | ||||||||||||||||
Name: gfx1201 | ||||||||||||||||
Marketing Name: AMD Instinct machine | ||||||||||||||||
Vendor Name: AMD | ||||||||||||||||
Node: 9 | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
- After getting the `Node`, please run `rocm-smi` and determine which `Device` corresponds with your `Node`. From this example, `ROCR_VISIBLE_DEVICES` is 5: | ||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
============================================ ROCm System Management Interface ============================================ | ||||||||||||||||
====================================================== Concise Info ====================================================== | ||||||||||||||||
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% | ||||||||||||||||
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID) | ||||||||||||||||
========================================================================================================================== | ||||||||||||||||
5 9 0x0000, 00000 00.0°C 000.0W 0000, 000, 0 000Mhz 000Mhz 0% 0000 000.0W 0% 0% | ||||||||||||||||
========================================================================================================================== | ||||||||||||||||
================================================== End of ROCm SMI Log =================================================== | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
- Then run these commands with your correct `ROCR_VISIBLE_DEVICES` | ||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
cd {IDENTIFIER_FROM_STEP_3} | ||||||||||||||||
sudo ./runner_setup_2.sh {ROCR_VISIBLE_DEVICE} | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
You are <b>done!</b>. You can use your CI runner using `runs-on: {LABEL}` in GitHub workflows and you'll be able to see your runner in your organization runners page as "Idle" | ||||||||||||||||
|
||||||||||||||||
Appendix: | ||||||||||||||||
|
||||||||||||||||
- [Requirements for self hosted runners](https://github.com/shivammathur/setup-php/wiki/Requirements-for-self-hosted-runners) | ||||||||||||||||
- [Configuring the self-hosted runner application as a service](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/configuring-the-self-hosted-runner-application-as-a-service) | ||||||||||||||||
- [ROCm quick start installation guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) | ||||||||||||||||
- [Docker install Ubuntu](https://docs.docker.com/engine/install/ubuntu/) |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is more than documentation. It should be moved to somewhere more like https://github.com/ROCm/TheRock/tree/main/build_tools/github_action . I've used |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#!/bin/bash | ||
|
||
# Add Docker's official GPG key: | ||
sudo apt-get update | ||
sudo apt-get install ca-certificates curl -y | ||
sudo install -m 0755 -d /etc/apt/keyrings | ||
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc | ||
sudo chmod a+r /etc/apt/keyrings/docker.asc | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please link to some official documentation for how this is the "official GPG key" or these are the recommended setup steps. |
||
|
||
# Add the repository to Apt sources: | ||
echo \ | ||
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ | ||
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may want to support more Linux distributions than Ubuntu at some point. Could put "ubuntu" in the script name somewhere. |
||
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null | ||
sudo apt-get update | ||
|
||
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my other comment. What specifically do we need from this script for CI runners that we don't build in TheRock? |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#!/bin/bash | ||
|
||
# ROCm install | ||
wget https://repo.radeon.com/amdgpu-install/6.4/ubuntu/noble/amdgpu-install_6.4.60400-1_all.deb | ||
sudo apt install ./amdgpu-install_6.4.60400-1_all.deb -y | ||
sudo apt update | ||
sudo apt install python3-setuptools python3-wheel -y | ||
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups | ||
sudo apt install rocm -y | ||
|
||
# AMD drive install | ||
wget https://repo.radeon.com/amdgpu-install/6.4/ubuntu/noble/amdgpu-install_6.4.60400-1_all.deb | ||
sudo apt install ./amdgpu-install_6.4.60400-1_all.deb -y | ||
sudo apt update | ||
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)" -y | ||
sudo apt install amdgpu-dkms -y | ||
|
||
# required | ||
sudo systemctl reboot |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#!/bin/bash | ||
|
||
# runner setup | ||
mkdir "actions-runner-$1" && cd "actions-runner-$1" | ||
curl -o actions-runner-linux-x64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-linux-x64-2.323.0.tar.gz | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you make the runner version an argument to the script, or do something to select the latest? |
||
echo "0dbc9bf5a58620fc52cb6cc0448abcca964a8d74b5f39773b7afcad9ab691e19 actions-runner-linux-x64-2.323.0.tar.gz" | shasum -a 256 -c | ||
tar xzf ./actions-runner-linux-x64-2.323.0.tar.gz |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#!/bin/bash | ||
|
||
# sudo enablement | ||
sudo usermod -a -G sudo "$(id -un)" | ||
echo "%sudo ALL = (ALL) NOPASSWD: ALL" | sudo tee -a /etc/sudoers | ||
|
||
# additional packages | ||
sudo apt install gfortran git git-lfs ninja-build cmake g++ pkg-config xxd patchelf automake python3-venv python3-dev libegl1-mesa-dev | ||
|
||
# svc install | ||
sudo ./svc.sh install root | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is svc? Where is this |
||
echo ROCR_VISIBLE_DEVICES=$1 >> .env | ||
|
||
sudo ./svc.sh start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar/typo errors in this line, but also see my other comments about ROCm installs.