Skip to content

Adding Linux CI runner setup docs #490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions docs/ci_runner_setup/linux/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
## Linux CI runner setup

This directory contains documentation and scripts about setting up a Linux CI Runner for [`ROCm`](https://github.com/ROCm) organization and used by [`TheRock`](https://github.com/ROCm/TheRock) repository.

Note: you must have sufficient permissions to access [ROCm runner page](https://github.com/organizations/ROCm/settings/actions/runners)

### Setup

For brand new machines that do not that ROCm or Docker installed, please follow these steps. Otherwise, please skip to step 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar/typo errors in this line, but also see my other comments about ROCm installs.


1. Install ROCm to the machine using `sudo ./rocm_install.sh`. This script will install ROCm 6.4 and AMD drivers for Ubuntu24, then it will reboot the system.

- Rebooting the system is required to load ROCm.
- If you have a different Linux distribution, follow [ROCm installation quick start guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html)
- <b>After reboot, please try `rocminfo` and `rocm-smi` to make sure ROCm is loaded and drivers are installed.</b> If there are issues, please try each command in `rocm_install.sh` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we installing ROCm on machines? TheRock is ROCm, so we should be building whatever we need as part of our build/test/release workflows. If we need something specific for bootstrapping, let's extract that instead of pulling the full SDK down from some fixed older release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When setting up the scripts (particuarly for ROCR_VISIBLE_DEVICES) and debugging the machines, it's quite useful to have rocminfo and rocm-smi around. However, it's only used for those commands and the entire other rocm items aren't used.

Should we just use a version of TheRock during setup, figure out which GPUs is what, then remove it? so that way the machines have a fresh system with no ROCm installed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with the latter. We can pin a known green commit in a config file and either pull a release tarball or the specific artifacts that contain rocm-smi and rocminfo. We could also bundle only those two pre-build binaries in an extra package if it helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what is needed at runner setup time with visible devices?

In order of my preference:

  • No tool dependency for system setup, or use some standard Linux tooling
  • Minimal set of tools, bootstrapped from a stable release of TheRock
  • Minimal set of tools, bootstrapped from existing ROCm releases (maybe mirrored to S3)
  • apt install as here

I wouldn't trust test runners if we install ROCm (TheRock, community build) on top of an existing ROCm (non-community build) install. If we run setup on the host and then runners under Docker, that might be safer though.


1. If docker is not installed, please run `sudo ./docker_install.sh`. This script will download docker for Ubuntu.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all the commands in the script need sudo if you always run the script itself with sudo?


1. After ROCm and Docker are installed, please run `sudo ./runner_setup_1.sh {IDENTIFIER}`. There may be multiple GPUs per system, so please add an identifier to make this runner unique and easily understood. Examples: gfx1201 GPU -> `sudo ./runner_setup_1.sh gfx1201-gpu-1`

1. After the runner packages are there, please follow these steps and run the commands:

- Please retrieve token from [ROCm GitHub runner page](https://github.com/organizations/ROCm/settings/actions/runners/new?arch=x64&os=linux) in the `Configure` tab.
- Please add an unique identifying label for this CI runner. Example: Linux gfx1201 -> label `linux-gfx1201-gpu-rocm`. This is the label that will be used in workflows and will be shared amongst other identical machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could provide a list of all current labels for reference instead of a single example. Maybe a link to the runner page would be sufficient, but a table with more information here would be useful for those without access to that page.


```
cd {IDENTIFIER_FROM_STEP_3}
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: set the language for fenced code blocks when know to help with syntax highlighting (auto detection sometimes guesses correctly)

Suggested change
```
cd {IDENTIFIER_FROM_STEP_3}
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL}
```
```bash
cd {IDENTIFIER_FROM_STEP_3}
./config.sh --url https://github.com/ROCm --token {TOKEN} --no-default-labels --labels {LABEL}


- During the config.sh setup step:
- `Default` is fine for runner group
- For "name of runner," please include an unique identifier for this runner. Example: for runner gfx1201, `linux-gfx1201-gpu-rocm-1`. A good practice is to have `{LABEL}-{ID}`. Remember, label != name of runner, there may be many gfx1201 machines sharing the label `linux-gfx1201-gpu-rocm`.
- `_work` is fine for work folder.

1. After ./config.sh script has been completed, please follow these steps and run the commands:

- For your CI runner to run on a specific GPU, you will need to obtain the correct `{ROCR_VISIBLE_DEVICE}`.
- To get this, please run `rocminfo` and figure out which `Node` your GPU is running on. Example:

```
*******
Agent 10
*******
Name: gfx1201
Marketing Name: AMD Instinct machine
Vendor Name: AMD
Node: 9
```

- After getting the `Node`, please run `rocm-smi` and determine which `Device` corresponds with your `Node`. From this example, `ROCR_VISIBLE_DEVICES` is 5:

```
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
==========================================================================================================================
5 9 0x0000, 00000 00.0°C 000.0W 0000, 000, 0 000Mhz 000Mhz 0% 0000 000.0W 0% 0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================
```

- Then run these commands with your correct `ROCR_VISIBLE_DEVICES`

```
cd {IDENTIFIER_FROM_STEP_3}
sudo ./runner_setup_2.sh {ROCR_VISIBLE_DEVICE}
```

You are <b>done!</b>. You can use your CI runner using `runs-on: {LABEL}` in GitHub workflows and you'll be able to see your runner in your organization runners page as "Idle"

Appendix:

- [Requirements for self hosted runners](https://github.com/shivammathur/setup-php/wiki/Requirements-for-self-hosted-runners)
- [Configuring the self-hosted runner application as a service](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/configuring-the-self-hosted-runner-application-as-a-service)
- [ROCm quick start installation guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html)
- [Docker install Ubuntu](https://docs.docker.com/engine/install/ubuntu/)
17 changes: 17 additions & 0 deletions docs/ci_runner_setup/linux/docker_install.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more than documentation. It should be moved to somewhere more like https://github.com/ROCm/TheRock/tree/main/build_tools/github_action . I've used build_tools/github_actions/runner on other projects (note the plural, "GitHub Actions" is the branding: https://github.com/features/actions)

Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl -y
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to some official documentation for how this is the "official GPG key" or these are the recommended setup steps.


# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to support more Linux distributions than Ubuntu at some point. Could put "ubuntu" in the script name somewhere.

sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
19 changes: 19 additions & 0 deletions docs/ci_runner_setup/linux/rocm_install.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment. What specifically do we need from this script for CI runners that we don't build in TheRock?

Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

# ROCm install
wget https://repo.radeon.com/amdgpu-install/6.4/ubuntu/noble/amdgpu-install_6.4.60400-1_all.deb
sudo apt install ./amdgpu-install_6.4.60400-1_all.deb -y
sudo apt update
sudo apt install python3-setuptools python3-wheel -y
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
sudo apt install rocm -y

# AMD drive install
wget https://repo.radeon.com/amdgpu-install/6.4/ubuntu/noble/amdgpu-install_6.4.60400-1_all.deb
sudo apt install ./amdgpu-install_6.4.60400-1_all.deb -y
sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)" -y
sudo apt install amdgpu-dkms -y

# required
sudo systemctl reboot
7 changes: 7 additions & 0 deletions docs/ci_runner_setup/linux/runner_setup_1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

# runner setup
mkdir "actions-runner-$1" && cd "actions-runner-$1"
curl -o actions-runner-linux-x64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-linux-x64-2.323.0.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the runner version an argument to the script, or do something to select the latest?

echo "0dbc9bf5a58620fc52cb6cc0448abcca964a8d74b5f39773b7afcad9ab691e19 actions-runner-linux-x64-2.323.0.tar.gz" | shasum -a 256 -c
tar xzf ./actions-runner-linux-x64-2.323.0.tar.gz
14 changes: 14 additions & 0 deletions docs/ci_runner_setup/linux/runner_setup_2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash

# sudo enablement
sudo usermod -a -G sudo "$(id -un)"
echo "%sudo ALL = (ALL) NOPASSWD: ALL" | sudo tee -a /etc/sudoers

# additional packages
sudo apt install gfortran git git-lfs ninja-build cmake g++ pkg-config xxd patchelf automake python3-venv python3-dev libegl1-mesa-dev

# svc install
sudo ./svc.sh install root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is svc? Where is this svc.sh script coming from?

echo ROCR_VISIBLE_DEVICES=$1 >> .env

sudo ./svc.sh start
Loading