-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Design: Backup Cancellation #9284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
90a3a65
c759147
857a7f8
359b293
4afa1ae
ca2deea
e95a636
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,154 @@ | ||||||
|
|
||||||
| # Backup Cancellation Design | ||||||
|
|
||||||
| ## Abstract | ||||||
| This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification. | ||||||
| The design addresses GitHub issues [#9189](https://github.com/vmware-tanzu/velero/issues/9189 | ||||||
| ) and [#2098](https://github.com/vmware-tanzu/velero/issues/2098) by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated. | ||||||
|
|
||||||
| ## Background | ||||||
| Currently, Velero lacks the ability to cancel running backups, leading to several critical issues. | ||||||
| When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate. | ||||||
| Additionally, the backup deletion controller prevents running backups from being deleted. | ||||||
|
|
||||||
|
|
||||||
| ## Goals | ||||||
| - Enable users to cancel running backups through a `cancel` field in the backup specification | ||||||
| - Cleanly cancel all associated async operations (BackupItemAction operations, DataUploads, PodVolumeBackups) | ||||||
| - Provide clear backup phase transitions (InProgress → Cancelling → Cancelled) | ||||||
|
|
||||||
| ## Non Goals | ||||||
| - Cancelling backups that have already completed or failed | ||||||
| - Rolling back partially completed backup operations | ||||||
| - Implementing cancellation for restore operations (future work) | ||||||
|
|
||||||
|
|
||||||
| ## High-Level Design | ||||||
| The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation. | ||||||
|
||||||
| Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops. | ||||||
|
|
||||||
| A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work. | ||||||
|
||||||
| This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handles DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase. | ||||||
|
||||||
| The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution. | ||||||
|
|
||||||
| ## Detailed Design | ||||||
|
|
||||||
| ### API Changes | ||||||
| Add a new field to `BackupSpec`: | ||||||
| ```go | ||||||
| type BackupSpec struct { | ||||||
| // ... existing fields ... | ||||||
|
|
||||||
| // Cancel indicates whether the backup should be cancelled. | ||||||
| // When set to true, Velero will attempt to cancel all ongoing operations | ||||||
| // and transition the backup to Cancelled phase. | ||||||
| // +optional | ||||||
| Cancel *bool `json:"cancel,omitempty"` | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| Add new backup phases to `BackupPhase`: | ||||||
| ```go | ||||||
| const ( | ||||||
| // ... existing phases ... | ||||||
| BackupPhaseCancelling BackupPhase = "Cancelling" | ||||||
| BackupPhaseCancelled BackupPhase = "Cancelled" | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| ### Controller Changes | ||||||
|
|
||||||
| #### Existing Controllers | ||||||
| Modify `backup_controller.go`, `backup_operations_controller.go`, and `backup_finalizer_controller.go` to check for cancellation: | ||||||
| ```go | ||||||
| // Early in each Reconcile method | ||||||
| if backup.Spec.Cancel != nil && *backup.Spec.Cancel { | ||||||
| if backup.Status.Phase != BackupPhaseCancelling && backup.Status.Phase != BackupPhaseCancelled { | ||||||
| backup.Status.Phase = BackupPhaseCancelling | ||||||
| // Update backup and return | ||||||
| return ctrl.Result{}, c.Client.Patch(ctx, backup, client.MergeFrom(original)) | ||||||
| } | ||||||
| return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups | ||||||
| } | ||||||
| ``` | ||||||
| In addition, the `backup_operations_controller.go` will have a periodic check around backup progress updates, rather than running every time progress is updated to reduce API load. | ||||||
|
||||||
|
|
||||||
| #### New Backup Cancellation Controller | ||||||
| Create `backup_cancellation_controller.go`: | ||||||
| ```go | ||||||
| type backupCancellationReconciler struct { | ||||||
| client.Client | ||||||
| logger logrus.FieldLogger | ||||||
| itemOperationsMap *itemoperationmap.BackupItemOperationsMap | ||||||
| newPluginManager func(logger logrus.FieldLogger) clientmgmt.Manager | ||||||
| backupStoreGetter persistence.ObjectBackupStoreGetter | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| The controller will: | ||||||
| 1. Watch for backups in `BackupPhaseCancelling` | ||||||
| 2. Get operations from `itemOperationsMap.GetOperationsForBackup()` | ||||||
| 3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations | ||||||
| 4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true` | ||||||
|
||||||
| 5. Wait for all cancellations to complete | ||||||
|
||||||
| 6. Set backup phase to `BackupPhaseCancelled` | ||||||
| 7. Update backup metadata in object storage | ||||||
|
|
||||||
| ### Cancellation Flow | ||||||
|
|
||||||
| #### BackupItemAction Operations | ||||||
| For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions): | ||||||
|
||||||
| 1. Controller calls `bia.Cancel(operationID, backup)` | ||||||
|
||||||
| if operation.Status.Created.Time.Add(backup.Spec.ItemOperationTimeout.Duration).Before(time.Now()) { | |
| _ = bia.Cancel(operation.Spec.OperationID, backup) |
Could fire-and-forget
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this case, we need to rely on the existing logics to wait the action complete or timeout. So we need to cover the case in the design.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other backup operations, e.g, hooks, native snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to check status before invoking the native snapshot, and in any polling code waiting for completion. But there is no cancel call in the native snapshot plugin API, so we may need to wait for any already-started snapshots to complete and document this behavior.
For hooks, as with item backup, we need to check for cancellation before invoking a hook. but we don't need to deal with in-progress hooks since those aren't long-running operations and they're done synchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. So let's add some workflow in the design to elaborate on how we add this checkpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it cancellation just stops sending ItemBlocks to the worker pool i.e. it's more of an early exit, and then setting PVB spec cancel fields to true
BackupWithResolvers would return and so the hooks would run
Native snapshots are synchronous, so we can't cancel them, only avoid running if we see cancel is true by early exit
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other reason to reject this is that deleting an object while another controller is still operating on it could lead to nondeterministic behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is just apart of Kubernetes, if the controller had to process something before deletion, then they need to add a finalizer.
What information will be on the cancelled backup that should be inspected, what is the user going to get from the canceled object that is useful.
If we are keeping cancelled backups, do we need some sort of pruning from object storage, I could imagine this list of cancelled backups could grow pretty fast.
If we do keep the backup, what other checks need to be put in place to make sure that you can't use a cancelled backup in restore?
In a similar vein, is there something that needs to be done differently on a delete of a cancelled backup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancelled backups may or may not have anything in object storage, but pruning object storage is already part of backup deletion. One option may be to have a separate (shorter) expiration cycle for cancelled backups. i.e. "completed backups have a ttl of 30 days, cancelled backups have a ttl of 5 days" -- then regular gc would clean up cancelled backups, they'll just be cleaned up sooner than completed ones.
Deleting a cancelled backup should be identical to deleting a completed backup, except that we may not have all of the components in object storage. i.e. if we cancel before uploading anything, there's nothing in object storage. If we cancel after the tarball and some data uploads have completed, then we have most of the s3 contents to delete. But basically, deleting a cancelled backup is just like deleting a regular backup as long as we don't error out if some expected object storage piece isn't there.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we already have timeout-based backup failure (both PVB and async BIAs have timeouts that trigger failure. That should be sufficient for timeout-based behavior.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that between Phase 1 and phase 2, "Cancelling" is essentially a terminal phase. Since we don't plan on having a release with only phase 1 in it, this should be fine.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backup deletion controller currently just bails if you try to delete an InProgress backup. The goal here is if a user creates a DeleteBackupRequest for a running backup, the backup deletion controller would make the cancel call on the backup and requeue with backoff. Once the backup has moved to the terminal state (based on all of the above here), then the deletion controller can perform its usual delete action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This provides 2 ways of using cancellation, depending on what the user needs:
- cancel and immediate deletion: No need to set the spec field manually, just do the normal delete operation -- i.e. running
velero delete backupwhich creates a DeleteBackupRequest, which then cancels the backup, and then deletes once cancellation has happened - Cancel without immediate delete. User just sets the spec field to cancel
Use case for 1) is "I didn't mean to back up that namespace with 1000 big volumes in it"
Use case for 2) is "I'm seeing lots of errors, there's no point in letting this go any further, but I want to examine the errors before deletion" -- basically the same use case as today's "wait for the backup to fail and then check out the errors, etc."
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have 2 choices for BSL metadata:
The second option has a couple problems. First, we need this logic both in the backup controller and in the backup operations controller. Second, we have to duplicate logic from the deletion controller. I think the shorter ttl may be the best approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using backup ttl will involve UX level concept into Velero's internal implementation logics. E.g., though unlikely, literally, users could change the ttl after Velero's cancellation logics set one.
Alternatively and similarly, how about we have the backupDeletion controller to directly check backups in
Cancelledphase, once found, run the same backup deletion workflow?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lyndon-Li If cancellation is initiated by the deletion controller because the user ran
velero backup deleteon a running backup, then yes, the deletion workflow will automatically remove the backup upon cancel. However, if the user is just trying to cancel (and not delete) the backup, then we do not want to run the deletion workflow immediately. We should either wait for normal expiration (or a subsequent user-initiated deletion request), or we could add a new cancellation ttl that's shorter.We can't assume that it's safe to delete the backup immediately. One cancellation use case is that a large backup is already generating numerous errors -- for example, DataUpload failures or something similar -- so the user does not want to wait for the backup to complete, but they want to be able to diagnose the failures. In that case, they'd want to cancel the backup, then look into the failures, etc. (possibly generate a debug bundle or something similar), and then delete the backup. If velero always deletes the backup immediately after cancellation, they won't be able to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree for the backup itself (backup CR and other CRs it associates to, e.g., DataUpload, PodVolumeBackup), but it is not true for the backup data.
After backup is cancelled, the backup data is meaningless so it is not reasonable to leave the data to take the storage space, especially for some large backups which would possibly bring noticeable costs.
So I am wondering if it is a good idea to let the backupDeleteionController to delete the backup data in the storage/repository immediately after the backup gets to
Cancelled.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lyndon-Li hmm. That makes some sense, but I'm not sure how we'd do that. Currently the BackupDeletionController only reconciles if there's a DeleteBackupRequest. We'd need to modify it to reconcile Backups too (i.e. the object type passed in would be one or the other, so the controller would need to deal with both) -- if a deletion request, follow current logic, if a backup, then ignore unless cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe we need to add another controller, so as to make the backupDeletionController clean, but we for sure need to share the common functions between the two.
Adding another controller is fine for this purpose because it doesn't change the state of the backup CRs, it just monitor and react.
Anyway, I think it is not a good idea to use the UX level interfaces to complete the internal workflows, e.g., use the TTL mechanism or creating a deleteBackupRequest.