CA DRA: integrate with cluster-wide resource limits

**Which component are you using?**:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

**Is your feature request designed to solve a problem? If so describe the problem this feature should solve.**:

Cluster Autoscaler allows users to configure cluster-wide resource limits which are taken into account when making scaling decisions, together with the usual NodeGroup Node limits. The resource limits can be defined for standard resources (CPU, memory), as well as custom extended resources (e.g. `nvidia.com/gpu`). However, making the limits work with a custom extended resource requires writing a `CustomResourcesProcessor` which translates a `Node` to a list of custom resources (`<resource_type`, `resource_count>`).

We have a `CustomResourcesProcessor` which handles `nvidia.com/gpu` by asking CloudProvider for a "GPU type label", and taking the value of that label from the Node. So if we have a Node with allocatable `nvidia.com/gpu: 4` and a `cloud.provider.com/gpu-type: nvidia-tesla-t4` label, the processor will translate it into `nvidia-tesla-t4: 4`. Whenever CA wants to provision such a Node, it checks if there is a limit configured for `nvidia-tesla-t4`, and if so if it allows adding 4 more resources.

It's not clear how to extend this model for DRA resources, because DRA devices are expressed differently than extended resources.

Using DRA, Node-local devices are exposed as 3-tuples: `<driver_name, pool_name, device_name>`. Taking the example above, a Node with 4 `nvidia-tesla-t4` GPUs would expose 4 separate devices, e.g.: `<nvidia.com, name-of-the-node, gpu-0>, <nvidia.com, name-of-the-node, gpu-1>, <nvidia.com, name-of-the-node, gpu-2>, <nvidia.com, name-of-the-node, gpu-3>`. Each such device would also have attribute maps, which should contain the `nvidia-tesla-t4` value _somewhere_.

**Describe the solution you'd like.**:

CA needs to somehow be able to figure out the "type" of a given device to configure and check limits on - by looking at the device attributes. We could follow the current example and ask CloudProvider for which attribute to look at, but IMO it'd be better to standardize a "device type" or a "device type for the purpose of autoscaling limits" parameter at the Kubernetes layer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA DRA: integrate with cluster-wide resource limits #8184

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CA DRA: integrate with cluster-wide resource limits #8184

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions