Skip to content

Conversation

israbbani
Copy link
Contributor

@israbbani israbbani commented Oct 12, 2025

This PR stacks on #57269 .

For more details about the resource isolation project see #54703.

In this PR, I've revised the algorithm for choosing default values for system-reserved-cpu and system-reserved-memory. The intent behind the algorithm is to:

  1. pick the value as a proportion of the available system resources where the proportion is specified by DEFAULT_SYSTEM_RESERVED_{MEMORY|CPU_CORES}_PROPORTION
  2. if the value < DEFAULT_MIN_SYSTEM_RESERVED_{CPU_CORES|MEMORY}, use the constant.
  3. if the value > DEFAULT_MAX_SYSTEM_RESERVED_{CPU_CORES|MEMORY}, use the constant.

This has a few useful properties

  • on small machines, there's a minimum value (avoiding pathological cases like 5% of 1 core is 0.05 cores reserved for the system cgroup)
  • for larger machines, the value scales up with the amount of available resources up to a maximum (avoiding pathological cases like on a 128 core machine, 6 cores are reserved for system processes)

israbbani and others added 30 commits September 30, 2025 17:17
CgroupManagerFactory which constructs a cross-platform cgroup manager
with selective compilation

Signed-off-by: irabbani <[email protected]>
and CgroupManagerFactory are the only public targets.
CgroupManagerFactory will delegate to the appropriate implementation for
each platform.

Signed-off-by: irabbani <[email protected]>
build. The Cgroup subsystem only exposes CgroupManagerInterface and
CgroupManagerFactory as public targets.

Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: israbbani <[email protected]>
subtrees:
- the system cgroup has all ray system processes.
- the workers cgroup has all ray worker processes.
- the user cgroup has all other non-ray processes on the system (usually
  used with containers).

Updated the integration tests.

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: israbbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
now be between [1,3] cpu cores (converted to weights) and [0.5G, 10G]
bytes of memory.

Signed-off-by: irabbani <[email protected]>
values for system-reserved-cpu and system-reserved-memory such that the
the values are
- always greater than a minimum value (1 cpu, 0.5GB of memory)
- scale proportionately with system resources if gte minimum (5% of cpu,
  10% of memory)
- cap out at a maximum value (3 cpus, 10G of memory)

Rewrote the unit tests to test this logic.

Signed-off-by: irabbani <[email protected]>
@israbbani israbbani added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Oct 12, 2025
@israbbani israbbani marked this pull request as ready for review October 12, 2025 21:33
@israbbani israbbani requested a review from a team as a code owner October 12, 2025 21:33
cursor[bot]

This comment was marked as outdated.

Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Base automatically changed from irabbani/cgroups-15 to master October 13, 2025 13:29
@edoakes
Copy link
Collaborator

edoakes commented Oct 13, 2025

@israbbani I merged upstream and re-triggered CI after merging the other one; ping to merge if tests pass here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants