Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions design/backup_cancellation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@

# Backup Cancellation Design

## Abstract
This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification.
The design addresses GitHub issues [#9189](https://github.com/vmware-tanzu/velero/issues/9189
) and [#2098](https://github.com/vmware-tanzu/velero/issues/2098) by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated.

## Background
Currently, Velero lacks the ability to cancel running backups, leading to several critical issues.
When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate.
Additionally, the backup deletion controller prevents running backups from being deleted.


## Goals
- Enable users to cancel running backups through a `cancel` field in the backup specification
- Cleanly cancel all associated async operations (BackupItemAction operations, DataUploads, PodVolumeBackups)
- Provide clear backup phase transitions (InProgress → Cancelling → Cancelled)

## Non Goals
- Cancelling backups that have already completed or failed
- Rolling back partially completed backup operations
- Implementing cancellation for restore operations (future work)


## High-Level Design
Copy link
Contributor

@Lyndon-Li Lyndon-Li Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add a diagram to describe the phase transition of backup CR after cancellation is added.
  2. Please add more details to elaborate on how to deal with the data that have already been uploaded to the BSL when a backup is cancelled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have 2 choices for BSL metadata:

  1. Rely on normal gc for expired backups, and perhaps facilitate this by having a shorter ttl for cancelled backups. For example, 30 day ttl for completed backups, 5 day (or shorter) ttl for cancelled backups.
  2. Clean up within the backup controller.

The second option has a couple problems. First, we need this logic both in the backup controller and in the backup operations controller. Second, we have to duplicate logic from the deletion controller. I think the shorter ttl may be the best approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using backup ttl will involve UX level concept into Velero's internal implementation logics. E.g., though unlikely, literally, users could change the ttl after Velero's cancellation logics set one.
Alternatively and similarly, how about we have the backupDeletion controller to directly check backups in Cancelled phase, once found, run the same backup deletion workflow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li If cancellation is initiated by the deletion controller because the user ran velero backup delete on a running backup, then yes, the deletion workflow will automatically remove the backup upon cancel. However, if the user is just trying to cancel (and not delete) the backup, then we do not want to run the deletion workflow immediately. We should either wait for normal expiration (or a subsequent user-initiated deletion request), or we could add a new cancellation ttl that's shorter.

We can't assume that it's safe to delete the backup immediately. One cancellation use case is that a large backup is already generating numerous errors -- for example, DataUpload failures or something similar -- so the user does not want to wait for the backup to complete, but they want to be able to diagnose the failures. In that case, they'd want to cancel the backup, then look into the failures, etc. (possibly generate a debug bundle or something similar), and then delete the backup. If velero always deletes the backup immediately after cancellation, they won't be able to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they'd want to cancel the backup, then look into the failures, etc. (possibly generate a debug bundle or something similar), and then delete the backup. If velero always deletes the backup immediately after cancellation, they won't be able to do that.

I agree for the backup itself (backup CR and other CRs it associates to, e.g., DataUpload, PodVolumeBackup), but it is not true for the backup data.
After backup is cancelled, the backup data is meaningless so it is not reasonable to leave the data to take the storage space, especially for some large backups which would possibly bring noticeable costs.

So I am wondering if it is a good idea to let the backupDeleteionController to delete the backup data in the storage/repository immediately after the backup gets to Cancelled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li hmm. That makes some sense, but I'm not sure how we'd do that. Currently the BackupDeletionController only reconciles if there's a DeleteBackupRequest. We'd need to modify it to reconcile Backups too (i.e. the object type passed in would be one or the other, so the controller would need to deal with both) -- if a deletion request, follow current logic, if a backup, then ignore unless cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe we need to add another controller, so as to make the backupDeletionController clean, but we for sure need to share the common functions between the two.
Adding another controller is fine for this purpose because it doesn't change the state of the backup CRs, it just monitor and react.

Anyway, I think it is not a good idea to use the UX level interfaces to complete the internal workflows, e.g., use the TTL mechanism or creating a deleteBackupRequest.

The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a spec value and not use a the finalizer and a delete on the backup?

If a back up is canceled are we considering that this backup should not be removed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily think we want immediate deletion. Just like we don't immediately delete failed backups or failed validation. One valid use case is that I ran a big backup and I'm seeing my datauploads all failing consistently, so I cancel it, but I want to debug it before deleting the backup. If canceling nukes the backup immediately, then I lose that ability. What about just setting a shorter ttl upon cancellation? Canceled backups are deleted in a few days by default, users can make them delete in an hour, for example, if they want to via config (shorter than that won't make sense, since I think the GC controller only runs hourly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have some doubt whether we wanna expose the "cancel" option to user and allow the user to cancel the backup, and keeps it in the object storage as "cancelled".

When reviewing issue #9189, I used to think the cancellation should only be triggered when the user deletes a backup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt One reason we want cancellation outside of delete is for debugging. If an in-progress backup is already accumulating errors, a user may want to cancel the backup (since they need to fix something), but they may not want to delete it right away, as they'll lose the ability to inspect the CRs for the failed backup.

Since cancel will need to be implemented to allow deleting running backups anyway, and the cancel functionality will be largely separated from delete, I don't see any reason not to expose it to the user. I know we've had requests for cancellation in the past. If we don't expose it now, we'll continue to get requests for it. I don't see any downside here. We have the functionality anyway, why not let the user cancel? It seems to be an expected operation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're balancing two important, competing user needs:

  1. The ability to cancel a backup but keep the CR for post-mortem debugging
  2. The need to avoid orphaned data and unnecessary costs in object storage from the incomplete backup

What if we separate the lifecycle of the backup CR from the lifecycle of its data in object storage? A possible flow could be:

  1. A user initiates a cancellation by setting spec.cancel: true.
  2. The backup controller transitions the backup to a Cancelling phase and gracefully stops all ongoing operations.
  3. Once all operations are terminated, the controller immediately triggers a cleanup of the associated data in the backup storage location. This could be handled by a dedicated cleanup controller or by extending the backupDeletionController to watch for backups in the Cancelled phase.
  4. After the storage cleanup is complete, the backup CR itself is moved to the final Cancelled phase. The CR remains in the cluster for inspection, but its associated storage artifacts are gone. We could even add a status condition like BackupDataDeleted to make this explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep the backup logs

Agree, we should keep the backup associated data generated by Velero itself (non backup data). Their lifecycle should be the same as the the backup CRs in the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To conclude:

  • For backup data, they should be deleted as the process of Cancel operation. In another word, once the backup is set as Cancelled, the backup data should have been deleted. We need a new functionality to make it, the existing TTL should not and can not be used.
  • For backup associated data, they should be deleted along with the backup CRs. So the existing TTL mechanism could cover it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need a phase marking cancel is starting but not done deleting data? that way we end up with Cancelled=deleted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaovilai That's the "cancelling" phase. i.e. backup controller marks backup "Cancelling", exits, then cancel controller does its thing and marks it "Cancelled"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented the phases the backup_cancellation_controller would go through
So starts at BackupPhaseCancelling

When async cancels and deletion is done, sets to terminal phase BackupPhaseCancelled

Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops.

A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work.
Copy link
Contributor

@Lyndon-Li Lyndon-Li Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a dedicated controller?
Logically, cancellation are phases of the backup CR as same as other phases, e.g., new, failed, inProgress, etc. It is more reasonable to process them all together.
Practically, if we allow multiple controllers to process the same backup CR, we must pay attention on the contest resolving, which is not necessary.

As existing examples, the controllers for PVB, PVR, DataUpload, DataDownload, handle cancellation in the same controller and reconciler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have more than one controller for the backup. There's the backup reconciler and the backup operations reconciler. But I think you're right -- we don't necessarily need to create a completely new one. We may not even need both "Canceling" and "Cancelled" phases -- basically the existing controllers see "cancel=true", stop work, cancel datauploads, set phase="cancelled"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dedicated controller could be avoid if the following are handled within the same controllers:

  1. async op cancellation: backup_operations_controller could do this
  2. backup data deletion: backup_deletion_controller would need to be invoked by creating a special BackupDeleteRequest that doesn't delete the logs and the backup resource itself

Would mean that multiple points in other controllers would need to do step 2 if cancel is true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancel action and cancelled status is more suitable to be part of the primitive state machine for the backup CR, so the processing is better to be together with other primitive states, which is helpful for code understanding and code management.

Async operation is not with a action-state pattern and backup deletion controller is not for backup CR but for deleteBackupRequest CR.

This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handles DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically related to the asynchronous phase of BIAs, since the Execute() calls will no longer be in progress once the backup reconciler exits. I would replace "on all in-progress BackupItemOperations" with "on all incomplete asynchronous BackupItemAction operations".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since PodVolumeBackups are handled synchronously, I don't think any of these will still be InProgress by the time the backup gets to the Cancelling" phase. Within the item backupper, upon completing an ItemBlock, we call waitUntilPVBsProcessed, checking PVB status every 5 seconds. You probably need a cancellation check here and if cancelled, set cancelled on PVB and exit the ItemBlock backup. So PVB cancellation will happen in the backup reconciler rather than in the cancellation reconciler.

Copy link
Contributor Author

@Joeavaikath Joeavaikath Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within BackupWithResolvers, when we detect cancellation we break out of the Item -> ItemBlock processing loop
I then plan to get and set the cancel spec on PodVolumeBackups associated with the backup within BackupWIthResolvers when cancellation is detected
waitUntilPVBsProcessed checks for either completion or cancellation phase

The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution.

## Detailed Design

### API Changes
Add a new field to `BackupSpec`:
```go
type BackupSpec struct {
// ... existing fields ...

// Cancel indicates whether the backup should be cancelled.
// When set to true, Velero will attempt to cancel all ongoing operations
// and transition the backup to Cancelled phase.
// +optional
Cancel *bool `json:"cancel,omitempty"`
}
```

Add new backup phases to `BackupPhase`:
```go
const (
// ... existing phases ...
BackupPhaseCancelling BackupPhase = "Cancelling"
BackupPhaseCancelled BackupPhase = "Cancelled"
)
```

### Controller Changes

#### Existing Controllers
Modify `backup_controller.go`, `backup_operations_controller.go`, and `backup_finalizer_controller.go` to check for cancellation:
```go
// Early in each Reconcile method
if backup.Spec.Cancel != nil && *backup.Spec.Cancel {
if backup.Status.Phase != BackupPhaseCancelling && backup.Status.Phase != BackupPhaseCancelled {
backup.Status.Phase = BackupPhaseCancelling
// Update backup and return
return ctrl.Result{}, c.Client.Patch(ctx, backup, client.MergeFrom(original))
}
return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups
}
```
In addition, the `backup_operations_controller.go` will have a periodic check around backup progress updates, rather than running every time progress is updated to reduce API load.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backup operations controller reconciles once every 10 seconds. It should be fine to check for cancellation here on every reconcile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an API call to get the Backup resource every 10 seconds should not be an issue
If true set phase to BackupPhaseCancelling and backup_cancellation_controller takes it up


#### New Backup Cancellation Controller
Create `backup_cancellation_controller.go`:
```go
type backupCancellationReconciler struct {
client.Client
logger logrus.FieldLogger
itemOperationsMap *itemoperationmap.BackupItemOperationsMap
newPluginManager func(logger logrus.FieldLogger) clientmgmt.Manager
backupStoreGetter persistence.ObjectBackupStoreGetter
}
```

The controller will:
1. Watch for backups in `BackupPhaseCancelling`
2. Get operations from `itemOperationsMap.GetOperationsForBackup()`
3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations
4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably skip 4, since those run synchronously, so we can cancel those during the wait cycle in the item backupper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, post hook handling means that if cancellation isn't done before the BackupWithResolvers call, then all ItemBlocks and Pod volume backups will be processed
all or nothing
cancellation after BackupWithResolvers means that we only care about async BIAs

5. Wait for all cancellations to complete
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how long DataUpload cancel normally takes. We can't guarantee that third-party plugins will handle Cancel gracefully, though, so we may not want to wait after cancel until the operation registers completion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, then you will need some way to tell the user that the backup is still cancelling and point to the DataUpload that is taking time IMO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawn-hurley what about other async operations from external plugins? There's no guarantee that they even support cancellation. From the backup's point of view, we've called cancel, we can move on. I guess we have a few options:

  1. fire and forget. Call cancel, move on and cancel the backup, and maybe it's OK that the backup is cancelled while things like datauploads or external 3rd party actions are still cancelling.
  2. call cancel and block until the Progress call confirms that the operation has completed. This means that, for example, an InProgress backup would remain in progress until every cancelled data upload stops running, and for 3rd party plugins which can't actually force a cancel, waiting until they complete normally
  3. Similar to 2, but with a short-ish timeout. And yes, we could probably add an error to the backup for cancellation timeout. How long should we wait, though? 1 minute? 5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards 1
It's about best effort
Trying to figure out what cleaning up completed DataUploads looks like

6. Set backup phase to `BackupPhaseCancelled`
7. Update backup metadata in object storage

### Cancellation Flow

#### BackupItemAction Operations
For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about BIA V1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't have any operations to cancel. Once the item backupper completes, it's done. We should check status.cancel before backing up an item (but only make apiserver calls to refresh the backup periodically, not on every item) -- and if cancel is true, then we return immediately and don't call the BIA at all. For an item that is currently being backed up when we refresh to get cancel=true, we don't need to do any action. Wait for it to complete (it's just one item)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. So let's add some workflow in the design to elaborate on how we add this checkpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added how this would be done within backup_controller -> BackupWithResolver which is where most synchronous Item Itemblock processing occurs

1. Controller calls `bia.Cancel(operationID, backup)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bia.Cancel may return an error which indicates the BIA doesn't support cancel, in this case, we still not some way to wait for its completion.

Copy link
Contributor Author

@Joeavaikath Joeavaikath Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do call this in a best-effort way when the ops timeout within backup_operations_controller for timeouts

if operation.Status.Created.Time.Add(backup.Spec.ItemOperationTimeout.Duration).Before(time.Now()) {
_ = bia.Cancel(operation.Spec.OperationID, backup)

Could fire-and-forget

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in this case, we need to rely on the existing logics to wait the action complete or timeout. So we need to cover the case in the design.

2. CSI PVC action finds associated DataUpload and sets `du.Spec.Cancel = true`
3. Node-agent DataUpload controller handles actual cancellation
4. Operation marked as `OperationPhaseCanceled`

#### PodVolumeBackup Operations
For PodVolumeBackups (which lack BackupItemAction implementations):
1. Controller directly finds PVBs by backup UID label
2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs
3. Node-agent PodVolumeBackup controller handles actual cancellation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other backup operations, e.g, hooks, native snapshot?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to check status before invoking the native snapshot, and in any polling code waiting for completion. But there is no cancel call in the native snapshot plugin API, so we may need to wait for any already-started snapshots to complete and document this behavior.
For hooks, as with item backup, we need to check for cancellation before invoking a hook. but we don't need to deal with in-progress hooks since those aren't long-running operations and they're done synchronously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. So let's add some workflow in the design to elaborate on how we add this checkpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it cancellation just stops sending ItemBlocks to the worker pool i.e. it's more of an early exit, and then setting PVB spec cancel fields to true
BackupWithResolvers would return and so the hooks would run
Native snapshots are synchronous, so we can't cancel them, only avoid running if we see cancel is true by early exit


## Alternatives Considered


### Alternative 1: Deletion-Based Cancellation
Using backup deletion as the cancellation mechanism instead of a cancel field.
This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reason to reject this is that deleting an object while another controller is still operating on it could lead to nondeterministic behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is just apart of Kubernetes, if the controller had to process something before deletion, then they need to add a finalizer.

What information will be on the cancelled backup that should be inspected, what is the user going to get from the canceled object that is useful.

If we are keeping cancelled backups, do we need some sort of pruning from object storage, I could imagine this list of cancelled backups could grow pretty fast.

If we do keep the backup, what other checks need to be put in place to make sure that you can't use a cancelled backup in restore?

In a similar vein, is there something that needs to be done differently on a delete of a cancelled backup?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancelled backups may or may not have anything in object storage, but pruning object storage is already part of backup deletion. One option may be to have a separate (shorter) expiration cycle for cancelled backups. i.e. "completed backups have a ttl of 30 days, cancelled backups have a ttl of 5 days" -- then regular gc would clean up cancelled backups, they'll just be cleaned up sooner than completed ones.

Deleting a cancelled backup should be identical to deleting a completed backup, except that we may not have all of the components in object storage. i.e. if we cancel before uploading anything, there's nothing in object storage. If we cancel after the tarball and some data uploads have completed, then we have most of the s3 contents to delete. But basically, deleting a cancelled backup is just like deleting a regular backup as long as we don't error out if some expected object storage piece isn't there.


### Alternative 2: Timeout-Based Automatic Cancellation
Automatically cancelling backups after a configurable timeout.
This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we already have timeout-based backup failure (both PVB and async BIAs have timeouts that trigger failure. That should be sufficient for timeout-based behavior.


## Security Considerations
The cancel field requires the same RBAC permissions as updating other backup specification fields.
No additional security considerations are introduced as the cancellation mechanism reuses existing operation cancellation pathways that are already secured.

## Compatibility
The new `cancel` field is optional and defaults to nil/false, ensuring backward compatibility with existing backup specifications.
Existing backups will continue to work without modification.
The new backup phases (`Cancelling`, `Cancelled`) are additive and don't affect existing phase transitions.

## Implementation
Implementation will be done incrementally in the following phases:

**Phase 1**: API changes and basic cancellation detection
- Add `cancel` field to BackupSpec
- Add new backup phases
- Update existing controllers to detect cancellation and transition to `Cancelling` phase

**Phase 2**: Cancellation controller implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that between Phase 1 and phase 2, "Cancelling" is essentially a terminal phase. Since we don't plan on having a release with only phase 1 in it, this should be fine.

- Implement backup cancellation controller
- Add BackupItemAction operation cancellation
- Add PodVolumeBackup direct cancellation

**Phase 3**: Testing and refinement
- Comprehensive end-to-end testing
- Testing if slowdowns occur due to the frequency of checking `backup.Cancel` spec field
- Documentation and user guide updates

**Future Work**:

- Replacing logic that blocks deletion of in-progress backups with the cancellation flow, followed by usual deletion for a terminal phase (Cancelled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this further?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backup deletion controller currently just bails if you try to delete an InProgress backup. The goal here is if a user creates a DeleteBackupRequest for a running backup, the backup deletion controller would make the cancel call on the backup and requeue with backoff. Once the backup has moved to the terminal state (based on all of the above here), then the deletion controller can perform its usual delete action.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This provides 2 ways of using cancellation, depending on what the user needs:

  1. cancel and immediate deletion: No need to set the spec field manually, just do the normal delete operation -- i.e. running velero delete backup which creates a DeleteBackupRequest, which then cancels the backup, and then deletes once cancellation has happened
  2. Cancel without immediate delete. User just sets the spec field to cancel

Use case for 1) is "I didn't mean to back up that namespace with 1000 big volumes in it"
Use case for 2) is "I'm seeing lots of errors, there's no point in letting this go any further, but I want to examine the errors before deletion" -- basically the same use case as today's "wait for the backup to fail and then check out the errors, etc."