Reduce excessive number of running datamovers relative to loadConcurrency values

**Describe the problem/challenge you have**
During backup with EnableCSI and snapshot-move-data enabled there is a large number of datamovers started after the initial resource backup finishes.

Some of our applications have upwards of 100 PVCs resulting in nearly a hundred datamover pods.

The existing design model creates the datamovers but only lets them perform the backup up to the loadConcurrency configuration in parallel. While the waiting datamover pods use little in actual resources, the resource limits has to be set to what is required to backing up the largest in resource usage and skews scheduling of other workloads. 

The loadConcurrency cannot be increased beyond a certain point or kopia begins to run out memory from the nodes.

This example debug logs illustrate a small example. When loadConcurrency is constrained to 1 as well and the loadAffinity to a single node with 10 PVCs. The result is 10 datamovers only one of which executes.

[bundle-2024-10-23-15-47-30.tar.gz](https://github.com/user-attachments/files/17499037/bundle-2024-10-23-15-47-30.tar.gz)


**Describe the solution you'd like**
Do not deploy datamover pods until load concurrency policy allows datamovers to run.

**Anything else you would like to add:**

Proposed Solution:
1. New state: DatamoverDeploying

Instead of deploying the datamover during Prepared. Move the concurrency check and the datamover pod creation into this state.

2. To move from Accepted -> Prepared:
Largely remains unchanged but without the datamover pod creation.

Copy the VolumeSnapshot, VolumeSnapshotContents, and create the PVC. On restore, the changes are just create the PVC from the application PVC spec. Do not create the datamover container.

3. State change:
Prepared -> DatamoverDeploying:

The concurrency check occurs on this state. On success, create the datamover Pod.

4. DatamoverDeploying -> InProgress
The DataUpload should move into the Prepared state when datamover pod has phase Running as is.

**Environment:**

- Velero version (use `velero version`): 

Client:
        Version: main
        Git commit: c53ab20d56450d7b8a626e78fec15bfc3fd896f8-dirty
Server:
        Version: main

Only changes are to Makefile and modify Dockerfile.ubi from [github.com/openshift/velero](https://github.com/openshift/velero) to make a velero container compatible with OpenShift. Dockerfile from Velero will not work with OpenShift due permissions denied creating folders. I am aware of my less than stellar reputation. I do not have access to generic Kubernetes at all nor allowed to deploy generic Kubernetes. I have no choice but to use an alternate Dockerfile to get Velero in a container without Red Hat OADP additions and internal modifications.

A copy of the changes to Makefile and added Dockerfile.ubi are available at my branch at https://github.com/msfrucht/openshift-velero/tree/velero_in_openshift

- Kubernetes version (use `kubectl version`):

Client Version: 4.15.12
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.8
Kubernetes Version: v1.29.7+4510e9c

- Kubernetes installer & version: Red Hat OpenShift
- Cloud provider or hardware configuration: on-premise Red Hat Hypershift virtualized cluster
- OS (e.g. from `/etc/os-release`): Red Hat Enterprise Linux CoreOS 416.94.202408132101-0"

**Vote on this issue!**

This is an invitation to the Velero community to vote on issues, you can see the project's [top voted issues listed here](https://github.com/vmware-tanzu/velero/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).  
Use the "reaction smiley face" up to the right of this comment to vote.

- :+1: for "The project would be better with this feature added"
- :-1: for "This feature will not enhance the project in a meaningful way"


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce excessive number of running datamovers relative to loadConcurrency values #8344

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce excessive number of running datamovers relative to loadConcurrency values #8344

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions