🏃 Migrate CI to self-hosted runners #3174

qgallouedec · 2025-03-28T17:27:41Z

This PR:

Migration of our CI to a self-hosted runner. This allows us to test on more relevant hardware, since it has 1 GPU. (instead of CPU only)
Give up support for Windows. Reasons:

Windows is the recurring source of problems in our CI and takes up a lot of our time.
According to pypistats it's 3% of TRL usage.
Most optional dependencies don't support Windows anyway.
Note that this doesn't mean that TRL no longer works with Windows, it's just that we no longer test it and no longer officially support it.

🏎️💨 2.5x faster

HuggingFaceDocBuilderDev · 2025-03-28T17:32:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-03-29T00:32:43Z

.github/workflows/tests.yml

-        with:
-          fetch-depth: 0
-          submodules: recursive


we don't need it

qgallouedec · 2025-03-29T00:33:20Z

.github/workflows/tests.yml

+    defaults:
+      run:
+        shell: bash


bash is needed otherwise it uses sh which doesn't have source

qgallouedec · 2025-03-29T00:33:58Z

.github/workflows/tests.yml

-          cache: "pip"
-          cache-dependency-path: |
-              setup.py
-              requirements.txt


not compatible with the runner (I don't think it's very useful anyway)

qgallouedec · 2025-03-29T00:34:26Z

.github/workflows/tests.yml

+        run: |
+          apt-get update && apt-get install -y make git curl
+
+      - name: Install uv


Much faster installation with uv

qgallouedec · 2025-03-29T05:28:18Z

tests/test_grpo_trainer.py

                self.assertFalse(torch.equal(param, new_param), f"Parameter {n} has not changed.")

-    @unittest.skipIf(sys.platform.startswith("win"), "Skipping on Windows")  # compiling seems to be broken on Windows
-    def test_training_torch_compile(self):


this test was veeeery slow

qgallouedec · 2025-03-29T18:56:31Z

I'll merge because for now the CI is broken, but I'm still interested in your remarks if any

…ggingface#3131) Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]> log answer key to wandb all Table HTML logging table bump patch hmm formatting html esacape reward isnt string [Liger] Liger KTO support (huggingface#2812) Co-authored-by: Kashif Rasul <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]> 🏃 Migrate CI to self-hosted runners (huggingface#3174) ❤️‍🩹 [CI] fix transformers dev CI failure (huggingface#3176) Co-authored-by: Quentin Gallouédec <[email protected]> ⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint (huggingface#3148) Co-authored-by: Quentin Gallouédec <[email protected]> 📎 Fix is_clipped to compute the effective clip_ratio (huggingface#3175) Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]> Fix breaking typo for flash_attention reducing_memory_usage.md (huggingface#3190) Show unique prompts in GRPO WandB tables (huggingface#3191) 🐗 [CI] Fix trufflehog false positives (huggingface#3192) [GRPO] Improve completion length logging (huggingface#3188) preliminary openai compatible endpoint early concept, needs refining dedupe debug print some slop to work on unslop, missing hist almost valid pseudocode middle-ware monkey patch in mp.Pool()... remove unused More accurate .md need gpu renting lambda again much nicer small aider-chat and datasets conflict risky reqs change should work, but hacky some insights, but monkeypatching probably wont suffice refactor: Rewrite test script to use SWE-bench dataset with MultiProcessAider refactor: Remove logging statements from test.py one step closer finally, the correct abstraction doc todo unslop unslop undo accidental black cleaner abstraction new abstraction

Create dummy self-hosted test

4de3879

qgallouedec added 28 commits March 28, 2025 17:33

without container

0ceef92

disable the rest of the CI

22ebd84

some debugs

fcc0778

debug

651478d

debug

8696815

show gpus

d454731

multi gpu?

508b880

aws-g4dn-12xlarge

4e1a19f

large-cache

02a9e56

aws-g4dn-2xlarge-cache

fe63d5e

2x large

62a896f

aws-g4dn-12xlarge

94ee0fb

what are all these runners?

0293a56

aws

cbc2104

fix runner name

e456cca

only few of them

52658a9

cache

fab0ccd

other runners

51f5354

stick to 1 GPU

00d3d02

test

faf9b59

uv

d72acc4

install uv

5e892d4

system

a973907

comment

4cb98fd

comment

b90ec13

container

cfd128f

cudnn runtime

4b8ec15

install make

fe7f7c2

qgallouedec requested review from kashif, lewtun, rtrompier and shirinyamani March 29, 2025 00:31

qgallouedec commented Mar 29, 2025

View reviewed changes

.github/workflows/tests.yml

Comment on lines -27 to -29

with:

fetch-depth: 0

submodules: recursive

Copy link

Member Author

qgallouedec Mar 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need it

qgallouedec commented Mar 29, 2025

View reviewed changes

qgallouedec added 11 commits March 29, 2025 03:23

trl env

c0b72ab

dummy

ce7121c

minimal

3a63250

use torch image

45ee7d8

gpu all

09c6c52

use torch image

092841c

test require

5fe423d

fix test require

819639e

fix some tests

c3e1233

remove very slow test

f5ef9f0

setuptools wheel

9b4ec34

qgallouedec commented Mar 29, 2025

View reviewed changes

rtrompier requested a review from glegendre01 March 29, 2025 08:54

kashif approved these changes Mar 29, 2025

View reviewed changes

kashif mentioned this pull request Mar 29, 2025

❤️‍🩹 [CI] fix transformers dev CI failure #3176

Merged

qgallouedec merged commit 2fe2337 into main Mar 29, 2025
9 of 10 checks passed

qgallouedec deleted the migrate-ci-1 branch March 29, 2025 18:56

kashif pushed a commit to kashif/trl that referenced this pull request Mar 31, 2025

🏃 Migrate CI to self-hosted runners (huggingface#3174)

52e3f9f

qgallouedec mentioned this pull request Apr 1, 2025

📉 Add learning_rate argument to _maybe_log_save_evaluate #3206

Merged

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

🏃 Migrate CI to self-hosted runners (huggingface#3174)

5566f1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🏃 Migrate CI to self-hosted runners #3174

🏃 Migrate CI to self-hosted runners #3174

Uh oh!

qgallouedec commented Mar 28, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2025

Uh oh!

qgallouedec Mar 29, 2025

Uh oh!

qgallouedec Mar 29, 2025

Uh oh!

qgallouedec Mar 29, 2025

Uh oh!

qgallouedec Mar 29, 2025

Uh oh!

qgallouedec Mar 29, 2025

Uh oh!

qgallouedec commented Mar 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🏃 Migrate CI to self-hosted runners #3174

🏃 Migrate CI to self-hosted runners #3174

Uh oh!

Conversation

qgallouedec commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2025

Uh oh!

qgallouedec Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Mar 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qgallouedec commented Mar 28, 2025 •

edited

Loading