Skip to content

Commit 9097159

Browse files
committed
expand quota maint section
1 parent 7824dd7 commit 9097159

File tree

2 files changed

+32
-12
lines changed

2 files changed

+32
-12
lines changed

QUOTA_MAINTENANCE.md

+32-12
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,26 @@
11
# Quota Maintenance
22

3-
Kubernetes built-in `ResourceQuotas` should not be combined with Kueue quotas.
3+
A *team* in MLBatch is a group of users that share a resource quota.
44

5-
Kueue quotas can be adjusted post creation. Workloads already admitted are not
6-
impacted.
5+
In Kueue, the `ClusterQueue` is the abstraction used to define a pool
6+
of resources (`cpu`, `memory`, `nvidia.com/gpu`, etc.) that is
7+
available to a team. A `LocalQueue` is the abstraction used by
8+
members of the team to submit workloads to a `ClusterQueue` for
9+
execution using those resources.
10+
11+
Kubernetes built-in `ResourceQuotas` should not be used for resources that
12+
are being managed by `ClusterQueues`. The two quota systems are incompatible.
13+
14+
We strongly recommend maintaining a simple relationship between
15+
between teams, namespaces, `ClusterQueues` and `LocalQueues`. Each
16+
team should assigned to their own namespace that contains a single
17+
`LocalQueue` which is configured to be the only `LocalQueue` that
18+
targets the team's `ClusterQueue`.
19+
20+
The quotas assigned to a `ClusterQueue` can be dynamically adjusted by
21+
a cluster admin at any time. Adjustments to quotas only impact queued
22+
workloads; workloads already admitted for execution are not impacted
23+
by quota adjustments.
724

825
For Kueue quotas to be effective, the sum of all quotas for each managed
926
resource (`cpu`, `memory`, `nvidia.com/gpu`, `pods`) must be maintained to
@@ -14,15 +31,18 @@ less. Quotas should be reduced when the available capacity is reduced whether
1431
because of failures or due to the allocation of resources to non-batch
1532
workloads.
1633

17-
To facilitate the necessary quota adjustments, one option is to setup a
18-
dedicated cluster queue for slack capacity that other cluster queues can borrow
19-
from. This queue should not be associated with any team, project, namespace, or
20-
local queue. Its quota should be adjusted dynamically to reflect changes in
21-
cluster capacity. If sized appropriately, this queue will make adjustments to
22-
other cluster queues unnecessary for small cluster capacity changes. Concretely,
23-
two teams could be granted 45% of the cluster capacity, with 10% capacity set
24-
aside for this extra cluster queue. Any changes to the cluster capacity below
25-
10% can then be handled by adjusting the latter.
34+
To facilitate the necessary quota adjustments, we recommend setting up
35+
a dedicated cluster queue for slack capacity that other cluster queues
36+
can borrow from. This queue should not be associated with any team,
37+
project, namespace, or local queue. Its `lendingLimit` should be adjusted
38+
dynamically to reflect changes in cluster capacity. If sized
39+
appropriately, this queue will make adjustments to other cluster
40+
queues unnecessary for small cluster capacity changes. The figure
41+
below shows this recommended setup for an MLBatch cluster with three
42+
teams. Beginning with RHOAI 2.12 (AppWrapper v0.23), the dynamic
43+
adjustment of the Slack `ClusterQueue` `lendingLimit` can be
44+
configured to be fully automated.
45+
![Figure with ClusterQueues for three teams and slack](./figures/CohortWithSlackCQ.png)
2646

2747
Every resource name occurring in the resource requests or limits of a workload
2848
must be covered by a cluster queue intended to admit the workload, even if the

figures/CohortWithSlackCQ.png

122 KB
Loading

0 commit comments

Comments
 (0)