Elastic scheduling based on AI scenarios

### What is the problem you're trying to solve

In the AI training scenario, in order to improve the overall resource utilization of the cluster, efforts are made to effectively utilize idle resources. A flexible training scenario based on PyTorch's DDP is proposed, which can proactively control the scaling of pods based on the elastic characteristics of PyTorch's DDP to improve resource utilization and quickly schedule resources when guaranteed tasks require them. Generally in the distributed training of AI, it is better to keep the number of instances at an exponential or multiple of 2, so the scheduler is expected to retain the total number of pods seized by the seized elastic tasks according to the user-defined strategy.

### Describe the solution you'd like

Start an elastic task, min2, max256, and specify the scaling strategy as 2's exponent. It is expected that the minimum part is a guaranteed task, and the gang scheduling conforms to the allocate action and is normally scheduled. Within a scheduling cycle, after all guaranteed tasks are scheduled, the elastic tasks are scheduled. When there is idle resource in the cluster and the elastic tasks are less than the capability of the queue, scheduling occurs, and the total number of scheduled pods (including the guaranteed 2) meets the specified 2's exponent. Within a scheduling cycle, if there is no matching node when tasks are scheduled, the elastic tasks will be preempted to meet the scheduling of tasks. When preempting tasks, because the scaling strategy of the preempted elastic tasks needs to be met, the preemption strategy of the job must be based on.

In the AI training scenario, when training tasks are not urgent, elastic training can be started to ensure the operation of the minimum resources. When there is idle resource in the cluster, idle resources can be efficiently utilized for training. Moreover, when other guaranteed tasks need resources, the elastic tasks can be killed at any time to release resources. To prevent frequent and ineffective scaling and shrinking, the cooling time for scaling and shrinking of elastic tasks can be dynamically specified; if not set, the default is 10 minutes.

This practice has been implemented and running in our company for some time, greatly improving resource utilization. If this scenario is in line with the community's planning, I am very willing to contribute code and grow with the community.

### Additional context

A new elastic action needs to be added to be responsible for the scheduling of elastic tasks. The scheduling strategy should meet the requirements for elastic scaling in and out.
It is necessary to add the elastic task preemption logic in the reclaim action, and the preemption should be based on the job level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elastic scheduling based on AI scenarios #4666

What is the problem you're trying to solve

Describe the solution you'd like

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Elastic scheduling based on AI scenarios #4666

Description

What is the problem you're trying to solve

Describe the solution you'd like

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions