|
| 1 | + |
| 2 | +# Backup Cancellation Design |
| 3 | + |
| 4 | +## Abstract |
| 5 | +This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification. |
| 6 | +The design addresses GitHub issues [#9189](https://github.com/vmware-tanzu/velero/issues/9189 |
| 7 | +) and [#2098](https://github.com/vmware-tanzu/velero/issues/2098) by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated. |
| 8 | + |
| 9 | +## Background |
| 10 | +Currently, Velero lacks the ability to cancel running backups, leading to several critical issues. |
| 11 | +When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate. |
| 12 | +Additionally, the backup deletion controller prevents running backups from being deleted. |
| 13 | + |
| 14 | + |
| 15 | +## Goals |
| 16 | +- Enable users to cancel running backups through a `cancel` field in the backup specification |
| 17 | +- Cleanly cancel all associated async operations (BackupItemAction operations, DataUploads, PodVolumeBackups) |
| 18 | +- Provide clear backup phase transitions (InProgress → Cancelling → Cancelled) |
| 19 | + |
| 20 | +## Non Goals |
| 21 | +- Cancelling backups that have already completed or failed |
| 22 | +- Rolling back partially completed backup operations |
| 23 | +- Implementing cancellation for restore operations (future work) |
| 24 | + |
| 25 | + |
| 26 | +## High-Level Design |
| 27 | +The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation. |
| 28 | +Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops. |
| 29 | + |
| 30 | +A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work. |
| 31 | +This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handles DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase. |
| 32 | +The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution. |
| 33 | + |
| 34 | +## Detailed Design |
| 35 | + |
| 36 | +### API Changes |
| 37 | +Add a new field to `BackupSpec`: |
| 38 | +```go |
| 39 | +type BackupSpec struct { |
| 40 | + // ... existing fields ... |
| 41 | + |
| 42 | + // Cancel indicates whether the backup should be cancelled. |
| 43 | + // When set to true, Velero will attempt to cancel all ongoing operations |
| 44 | + // and transition the backup to Cancelled phase. |
| 45 | + // +optional |
| 46 | + Cancel *bool `json:"cancel,omitempty"` |
| 47 | +} |
| 48 | +``` |
| 49 | + |
| 50 | +Add new backup phases to `BackupPhase`: |
| 51 | +```go |
| 52 | +const ( |
| 53 | + // ... existing phases ... |
| 54 | + BackupPhaseCancelling BackupPhase = "Cancelling" |
| 55 | + BackupPhaseCancelled BackupPhase = "Cancelled" |
| 56 | +) |
| 57 | +``` |
| 58 | + |
| 59 | +### Controller Changes |
| 60 | + |
| 61 | +#### Existing Controllers |
| 62 | +Modify `backup_controller.go`, `backup_operations_controller.go`, and `backup_finalizer_controller.go` to check for cancellation: |
| 63 | +```go |
| 64 | +// Early in each Reconcile method |
| 65 | +if backup.Spec.Cancel != nil && *backup.Spec.Cancel { |
| 66 | + if backup.Status.Phase != BackupPhaseCancelling && backup.Status.Phase != BackupPhaseCancelled { |
| 67 | + backup.Status.Phase = BackupPhaseCancelling |
| 68 | + // Update backup and return |
| 69 | + return ctrl.Result{}, c.Client.Patch(ctx, backup, client.MergeFrom(original)) |
| 70 | + } |
| 71 | + return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups |
| 72 | +} |
| 73 | +``` |
| 74 | +In addition, the `backup_operations_controller.go` will have a periodic check around backup progress updates, rather than running every time progress is updated to reduce API load. |
| 75 | + |
| 76 | +#### New Backup Cancellation Controller |
| 77 | +Create `backup_cancellation_controller.go`: |
| 78 | +```go |
| 79 | +type backupCancellationReconciler struct { |
| 80 | + client.Client |
| 81 | + logger logrus.FieldLogger |
| 82 | + itemOperationsMap *itemoperationmap.BackupItemOperationsMap |
| 83 | + newPluginManager func(logger logrus.FieldLogger) clientmgmt.Manager |
| 84 | + backupStoreGetter persistence.ObjectBackupStoreGetter |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +The controller will: |
| 89 | +1. Watch for backups in `BackupPhaseCancelling` |
| 90 | +2. Get operations from `itemOperationsMap.GetOperationsForBackup()` |
| 91 | +3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations |
| 92 | +4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true` |
| 93 | +5. Wait for all cancellations to complete |
| 94 | +6. Set backup phase to `BackupPhaseCancelled` |
| 95 | +7. Update backup metadata in object storage |
| 96 | + |
| 97 | +### Cancellation Flow |
| 98 | + |
| 99 | +#### BackupItemAction Operations |
| 100 | +For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions): |
| 101 | +1. Controller calls `bia.Cancel(operationID, backup)` |
| 102 | +2. CSI PVC action finds associated DataUpload and sets `du.Spec.Cancel = true` |
| 103 | +3. Node-agent DataUpload controller handles actual cancellation |
| 104 | +4. Operation marked as `OperationPhaseCanceled` |
| 105 | + |
| 106 | +#### PodVolumeBackup Operations |
| 107 | +For PodVolumeBackups (which lack BackupItemAction implementations): |
| 108 | +1. Controller directly finds PVBs by backup UID label |
| 109 | +2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs |
| 110 | +3. Node-agent PodVolumeBackup controller handles actual cancellation |
| 111 | + |
| 112 | + |
| 113 | +## Alternatives Considered |
| 114 | + |
| 115 | + |
| 116 | +### Alternative 1: Deletion-Based Cancellation |
| 117 | +Using backup deletion as the cancellation mechanism instead of a cancel field. |
| 118 | +This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning. |
| 119 | + |
| 120 | +### Alternative 2: Timeout-Based Automatic Cancellation |
| 121 | +Automatically cancelling backups after a configurable timeout. |
| 122 | +This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation. |
| 123 | + |
| 124 | +## Security Considerations |
| 125 | +The cancel field requires the same RBAC permissions as updating other backup specification fields. |
| 126 | +No additional security considerations are introduced as the cancellation mechanism reuses existing operation cancellation pathways that are already secured. |
| 127 | + |
| 128 | +## Compatibility |
| 129 | +The new `cancel` field is optional and defaults to nil/false, ensuring backward compatibility with existing backup specifications. |
| 130 | +Existing backups will continue to work without modification. |
| 131 | +The new backup phases (`Cancelling`, `Cancelled`) are additive and don't affect existing phase transitions. |
| 132 | + |
| 133 | +## Implementation |
| 134 | +Implementation will be done incrementally in the following phases: |
| 135 | + |
| 136 | +**Phase 1**: API changes and basic cancellation detection |
| 137 | +- Add `cancel` field to BackupSpec |
| 138 | +- Add new backup phases |
| 139 | +- Update existing controllers to detect cancellation and transition to `Cancelling` phase |
| 140 | + |
| 141 | +**Phase 2**: Cancellation controller implementation |
| 142 | +- Implement backup cancellation controller |
| 143 | +- Add BackupItemAction operation cancellation |
| 144 | +- Add PodVolumeBackup direct cancellation |
| 145 | + |
| 146 | +**Phase 3**: Testing and refinement |
| 147 | +- Comprehensive end-to-end testing |
| 148 | +- Testing if slowdowns occur due to the frequency of checking `backup.Cancel` spec field |
| 149 | +- Documentation and user guide updates |
| 150 | + |
0 commit comments