test: Fix scheduler for CPU/RAM starvation #22748

martinpitt · 2026-01-09T13:56:02Z

Hopefully reproduces this mess from #22373. No changes for the first run, I want to see it fail.

We are seeing these problems even in PRs without affected "expensive" tests.

martinpitt · 2026-01-10T07:40:26Z

This initial run and a forced retry both have a few retries due to VM boot failures.

martinpitt · 2026-01-10T10:32:06Z

first run with fixed scheduler looks promising -- two real flakes, plus check-testlib failures which are a real bug (fixing now), but no VM boot failures!

martinpitt · 2026-01-10T13:41:05Z

ugh, that was worse again, the fine-tuning with measured variants was apparently too aggressive

martinpitt · 2026-01-11T09:19:50Z

This run was using 13 parallel spots, as I was refining the memory estimates. Turns out they are very good now!

RAM analysis:

Budgeted: 28.0 GB for 14 tests
Actual peak: 27.3 GB consumed (min available was 2.9 GB of 30.2 GB total)
Overestimation: Only 0.7 GB (2%)

So that confirms my recent suspicion that with 8 parallel tests we are not RAM bound, as I always thought. Instead, booting many parallel VMs is just too slow and times out, hence we are CPU/load bound. The recent commit already throttled the destructive tests by load, but it wasn't enough yet, and the 13 parallel ND tests are definitively too much.

Load Analysis:

Max load: 31.49 on 16 CPUs (almost 2× overload!)
Load at ND start: 0.93 → climbed to 12.3 during ND phase → peaked at 31.49 during destructive phase
Problem: 1-min loadavg is a trailing average - it started low (0.93), so all 14 VMs launched. By the time load climbed to 31.49, tests were already running.

The load throttling from the current top commit a264683 didn't trigger at all.

Analysis:

Peak loads:

During ND phase (14 VMs): Load peaked at 31.49 🔥
During destructive phase: Max load was only 15.67 (never hit the 16.0 threshold)

Load by VM count during destructive:

7-8 VMs: avg ~9.5, max 15.67
9 VMs: avg 4.91 (only briefly)

This should cause 3x affected retries.

Parallism is now purely dynamical, which makes so much more sense!

The load throttling blocked test 166 when load was 12.41 > 12.0 threshold, but then all running tests completed (0 tests running). The scheduler exited thinking it was done, but there are still 84 destructive tests left in the queue. When load is too high but no tests are running, the scheduler has no way to make progress and exits prematurely.

martinpitt · 2026-01-11T19:41:17Z

for $deity's sake, can this get green at all? Putting back the load test in addition to running_procs, and lowering the numbers even more. This is getting ridiculously inefficient, although surprisingly it only took 64 mins -- not really longer than the previous approaches.

martinpitt added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Jan 9, 2026

martinpitt force-pushed the scheduler branch from 688f9d0 to ff3b322 Compare January 10, 2026 09:18

martinpitt force-pushed the scheduler branch from ff3b322 to 8a8b78d Compare January 10, 2026 11:51

martinpitt force-pushed the scheduler branch 4 times, most recently from a9e684f to a264683 Compare January 10, 2026 19:34

martinpitt changed the title ~~test: Fix scheduler for running OOM~~ test: Fix scheduler for CPU/RAM starvation Jan 11, 2026

martinpitt added 11 commits January 11, 2026 15:02

DNM: Touch the shell

7afe5c1

This should cause 3x affected retries.

WIP test: fix scheduler

8e3f734

go back to measured browser RSS values, fix destructive scheduling

21e7a35

run-test: Drop TEST_JOBS and -j

5835c2b

Parallism is now purely dynamical, which makes so much more sense!

DNM debug logging

7ed604f

DNM logging

e5e73aa

DNM poll less often to avoid log spam

f560d81

also throttle on load

c737bbe

cap ND machines to nproc/2, more better load threshold

a56eb3c

try procs_running for more immediate/current measurement of CPU pressure

1f07d59

martinpitt force-pushed the scheduler branch from e88a8d9 to 1f07d59 Compare January 11, 2026 14:02

martinpitt added 2 commits January 11, 2026 19:28

test with lower limits

93fbdfd

put back load check as well, and lower numbers a lot for testing

464319b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Fix scheduler for CPU/RAM starvation #22748

test: Fix scheduler for CPU/RAM starvation #22748

martinpitt commented Jan 9, 2026 •

edited

Loading

Uh oh!

martinpitt commented Jan 10, 2026

Uh oh!

martinpitt commented Jan 10, 2026

Uh oh!

martinpitt commented Jan 10, 2026

Uh oh!

martinpitt commented Jan 11, 2026

Uh oh!

martinpitt commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

test: Fix scheduler for CPU/RAM starvation #22748

Are you sure you want to change the base?

test: Fix scheduler for CPU/RAM starvation #22748

Conversation

martinpitt commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinpitt commented Jan 10, 2026

Uh oh!

martinpitt commented Jan 10, 2026

Uh oh!

martinpitt commented Jan 10, 2026

Uh oh!

martinpitt commented Jan 11, 2026

Uh oh!

martinpitt commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

martinpitt commented Jan 9, 2026 •

edited

Loading