Skip to content

Conversation

@martinpitt
Copy link
Member

@martinpitt martinpitt commented Jan 9, 2026

Hopefully reproduces this mess from #22373. No changes for the first run, I want to see it fail.

We are seeing these problems even in PRs without affected "expensive" tests.

@martinpitt martinpitt added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Jan 9, 2026
@martinpitt
Copy link
Member Author

This initial run and a forced retry both have a few retries due to VM boot failures.

@martinpitt
Copy link
Member Author

first run with fixed scheduler looks promising -- two real flakes, plus check-testlib failures which are a real bug (fixing now), but no VM boot failures!

@martinpitt
Copy link
Member Author

@martinpitt martinpitt force-pushed the scheduler branch 4 times, most recently from a9e684f to a264683 Compare January 10, 2026 19:34
@martinpitt
Copy link
Member Author

This run was using 13 parallel spots, as I was refining the memory estimates. Turns out they are very good now!

RAM analysis:

  • Budgeted: 28.0 GB for 14 tests
  • Actual peak: 27.3 GB consumed (min available was 2.9 GB of 30.2 GB total)
  • Overestimation: Only 0.7 GB (2%)

So that confirms my recent suspicion that with 8 parallel tests we are not RAM bound, as I always thought. Instead, booting many parallel VMs is just too slow and times out, hence we are CPU/load bound. The recent commit already throttled the destructive tests by load, but it wasn't enough yet, and the 13 parallel ND tests are definitively too much.

Load Analysis:

  • Max load: 31.49 on 16 CPUs (almost 2× overload!)
  • Load at ND start: 0.93 → climbed to 12.3 during ND phase → peaked at 31.49 during destructive phase
  • Problem: 1-min loadavg is a trailing average - it started low (0.93), so all 14 VMs launched. By the time load climbed to 31.49, tests were already running.

The load throttling from the current top commit a264683 didn't trigger at all.

Analysis:

Peak loads:

  • During ND phase (14 VMs): Load peaked at 31.49 🔥
  • During destructive phase: Max load was only 15.67 (never hit the 16.0 threshold)

Load by VM count during destructive:

  • 7-8 VMs: avg ~9.5, max 15.67
  • 9 VMs: avg 4.91 (only briefly)

@martinpitt martinpitt changed the title test: Fix scheduler for running OOM test: Fix scheduler for CPU/RAM starvation Jan 11, 2026
This should cause 3x affected retries.
Parallism is now purely dynamical, which makes so much more sense!
The load throttling blocked test 166 when load was 12.41 > 12.0
threshold, but then all running tests completed (0 tests running). The
scheduler exited thinking it was done, but there are still 84
destructive tests left in the queue.

When load is too high but no tests are running, the scheduler has no way
to make progress and exits prematurely.
@martinpitt
Copy link
Member Author

for $deity's sake, can this get green at all? Putting back the load test in addition to running_procs, and lowering the numbers even more. This is getting ridiculously inefficient, although surprisingly it only took 64 mins -- not really longer than the previous approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-test For doc/workflow changes, or experiments which don't need a full CI run,

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant