-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is the problem you're trying to solve
In the AI training scenario, in order to improve the overall resource utilization of the cluster, efforts are made to effectively utilize idle resources. A flexible training scenario based on PyTorch's DDP is proposed, which can proactively control the scaling of pods based on the elastic characteristics of PyTorch's DDP to improve resource utilization and quickly schedule resources when guaranteed tasks require them. Generally in the distributed training of AI, it is better to keep the number of instances at an exponential or multiple of 2, so the scheduler is expected to retain the total number of pods seized by the seized elastic tasks according to the user-defined strategy.
Describe the solution you'd like
Start an elastic task, min2, max256, and specify the scaling strategy as 2's exponent. It is expected that the minimum part is a guaranteed task, and the gang scheduling conforms to the allocate action and is normally scheduled. Within a scheduling cycle, after all guaranteed tasks are scheduled, the elastic tasks are scheduled. When there is idle resource in the cluster and the elastic tasks are less than the capability of the queue, scheduling occurs, and the total number of scheduled pods (including the guaranteed 2) meets the specified 2's exponent. Within a scheduling cycle, if there is no matching node when tasks are scheduled, the elastic tasks will be preempted to meet the scheduling of tasks. When preempting tasks, because the scaling strategy of the preempted elastic tasks needs to be met, the preemption strategy of the job must be based on.
In the AI training scenario, when training tasks are not urgent, elastic training can be started to ensure the operation of the minimum resources. When there is idle resource in the cluster, idle resources can be efficiently utilized for training. Moreover, when other guaranteed tasks need resources, the elastic tasks can be killed at any time to release resources. To prevent frequent and ineffective scaling and shrinking, the cooling time for scaling and shrinking of elastic tasks can be dynamically specified; if not set, the default is 10 minutes.
This practice has been implemented and running in our company for some time, greatly improving resource utilization. If this scenario is in line with the community's planning, I am very willing to contribute code and grow with the community.
Additional context
A new elastic action needs to be added to be responsible for the scheduling of elastic tasks. The scheduling strategy should meet the requirements for elastic scaling in and out.
It is necessary to add the elastic task preemption logic in the reclaim action, and the preemption should be based on the job level.