|
| 1 | + |
| 2 | +# Backup Cancellation Design |
| 3 | + |
| 4 | +## Abstract |
| 5 | +This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification. |
| 6 | +The design addresses GitHub issues #9189 and #2098 by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated. |
| 7 | + |
| 8 | +## Background |
| 9 | +Currently, Velero lacks the ability to cancel running backups, leading to several critical issues. |
| 10 | +When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate. |
| 11 | +Additionally, the backup deletion controller doesn't prevent running backups from being deleted, causing async operations (DataUpload, PodVolumeBackup, itemBlock processing) to continue running unaware of the backup deletion, resulting in resource contests and incomplete backup data leaks. |
| 12 | + |
| 13 | +This problem is particularly acute in environments with frequent scheduled backups or large-scale backup operations that may run for extended periods. |
| 14 | +Users currently have no way to abort problematic backups other than restarting the Velero server, which affects all ongoing operations. |
| 15 | + |
| 16 | +## Goals |
| 17 | +- Enable users to cancel running backups through a `cancel` field in the backup specification |
| 18 | +- Cleanly cancel all associated async operations (BackupItemAction operations, DataUploads, PodVolumeBackups) |
| 19 | +- Prevent resource leaks and contests when backups are deleted or cancelled |
| 20 | +- Provide clear backup phase transitions (InProgress → Cancelling → Cancelled) |
| 21 | + |
| 22 | +## Non Goals |
| 23 | +- Cancelling backups that have already completed or failed |
| 24 | +- Rolling back partially completed backup operations |
| 25 | +- Implementing cancellation for restore operations (future work) |
| 26 | + |
| 27 | + |
| 28 | +## High-Level Design |
| 29 | +The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation. |
| 30 | +Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops. |
| 31 | + |
| 32 | +A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work. |
| 33 | +This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handle DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase. |
| 34 | +The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution. |
| 35 | + |
| 36 | +## Detailed Design |
| 37 | + |
| 38 | +### API Changes |
| 39 | +Add a new field to `BackupSpec`: |
| 40 | +```go |
| 41 | +type BackupSpec struct { |
| 42 | + // ... existing fields ... |
| 43 | + |
| 44 | + // Cancel indicates whether the backup should be cancelled. |
| 45 | + // When set to true, Velero will attempt to cancel all ongoing operations |
| 46 | + // and transition the backup to Cancelled phase. |
| 47 | + // +optional |
| 48 | + Cancel *bool `json:"cancel,omitempty"` |
| 49 | +} |
| 50 | +``` |
| 51 | + |
| 52 | +Add new backup phases to `BackupPhase`: |
| 53 | +```go |
| 54 | +const ( |
| 55 | + // ... existing phases ... |
| 56 | + BackupPhaseCancelling BackupPhase = "Cancelling" |
| 57 | + BackupPhaseCancelled BackupPhase = "Cancelled" |
| 58 | +) |
| 59 | +``` |
| 60 | + |
| 61 | +### Controller Changes |
| 62 | + |
| 63 | +#### Existing Controllers |
| 64 | +Modify `backup_controller.go`, `backup_operations_controller.go`, and `backup_finalizer_controller.go` to check for cancellation: |
| 65 | +```go |
| 66 | +// Early in each Reconcile method |
| 67 | +if backup.Spec.Cancel != nil && *backup.Spec.Cancel { |
| 68 | + if backup.Status.Phase != BackupPhaseCancelling && backup.Status.Phase != BackupPhaseCancelled { |
| 69 | + backup.Status.Phase = BackupPhaseCancelling |
| 70 | + // Update backup and return |
| 71 | + return ctrl.Result{}, c.Client.Patch(ctx, backup, client.MergeFrom(original)) |
| 72 | + } |
| 73 | + return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups |
| 74 | +} |
| 75 | +``` |
| 76 | + |
| 77 | +#### New Backup Cancellation Controller |
| 78 | +Create `backup_cancellation_controller.go`: |
| 79 | +```go |
| 80 | +type backupCancellationReconciler struct { |
| 81 | + client.Client |
| 82 | + logger logrus.FieldLogger |
| 83 | + itemOperationsMap *itemoperationmap.BackupItemOperationsMap |
| 84 | + newPluginManager func(logger logrus.FieldLogger) clientmgmt.Manager |
| 85 | + backupStoreGetter persistence.ObjectBackupStoreGetter |
| 86 | +} |
| 87 | +``` |
| 88 | + |
| 89 | +The controller will: |
| 90 | +1. Watch for backups in `BackupPhaseCancelling` |
| 91 | +2. Get operations from `itemOperationsMap.GetOperationsForBackup()` |
| 92 | +3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations |
| 93 | +4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true` |
| 94 | +5. Wait for all cancellations to complete |
| 95 | +6. Set backup phase to `BackupPhaseCancelled` |
| 96 | +7. Update backup metadata in object storage |
| 97 | + |
| 98 | +### Cancellation Flow |
| 99 | + |
| 100 | +#### BackupItemAction Operations |
| 101 | +For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions): |
| 102 | +1. Controller calls `bia.Cancel(operationID, backup)` |
| 103 | +2. CSI PVC action finds associated DataUpload and sets `du.Spec.Cancel = true` |
| 104 | +3. Node-agent DataUpload controller handles actual cancellation |
| 105 | +4. Operation marked as `OperationPhaseCanceled` |
| 106 | + |
| 107 | +#### PodVolumeBackup Operations |
| 108 | +For PodVolumeBackups (which lack BackupItemAction implementations): |
| 109 | +1. Controller directly finds PVBs by backup UID label |
| 110 | +2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs |
| 111 | +3. Node-agent PodVolumeBackup controller handles actual cancellation |
| 112 | + |
| 113 | +### Timing and Frequency |
| 114 | +- Use 5-second ticker for cancellation controller to prevent API overload |
| 115 | +- Existing controllers check cancellation on every reconcile (event-driven) |
| 116 | +- No timeout for cancellation operations (rely on existing operation timeouts) |
| 117 | + |
| 118 | +## Alternatives Considered |
| 119 | + |
| 120 | +### Alternative 1: Immediate Cancellation in Existing Controllers |
| 121 | +Instead of a dedicated cancellation controller, existing controllers could immediately cancel operations when detecting the cancel flag. |
| 122 | +This was rejected because it would complicate the existing controller logic and make the cancellation process less observable and debuggable. |
| 123 | + |
| 124 | +### Alternative 2: Deletion-Based Cancellation |
| 125 | +Using backup deletion as the cancellation mechanism instead of a cancel field. |
| 126 | +This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning. |
| 127 | + |
| 128 | +### Alternative 3: Timeout-Based Automatic Cancellation |
| 129 | +Automatically cancelling backups after a configurable timeout. |
| 130 | +This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation. |
| 131 | + |
| 132 | +## Security Considerations |
| 133 | +The cancel field requires the same RBAC permissions as updating other backup specification fields. |
| 134 | +No additional security considerations are introduced as the cancellation mechanism reuses existing operation cancellation pathways that are already secured. |
| 135 | + |
| 136 | +## Compatibility |
| 137 | +The new `cancel` field is optional and defaults to nil/false, ensuring backward compatibility with existing backup specifications. |
| 138 | +Existing backups will continue to work without modification. |
| 139 | +The new backup phases (`Cancelling`, `Cancelled`) are additive and don't affect existing phase transitions. |
| 140 | + |
| 141 | +## Implementation |
| 142 | +Implementation will be done incrementally in the following phases: |
| 143 | + |
| 144 | +**Phase 1**: API changes and basic cancellation detection |
| 145 | +- Add `cancel` field to BackupSpec |
| 146 | +- Add new backup phases |
| 147 | +- Update existing controllers to detect cancellation and transition to `Cancelling` phase |
| 148 | + |
| 149 | +**Phase 2**: Cancellation controller implementation |
| 150 | +- Implement backup cancellation controller |
| 151 | +- Add BackupItemAction operation cancellation |
| 152 | +- Add PodVolumeBackup direct cancellation |
| 153 | + |
| 154 | +**Phase 3**: Testing and refinement |
| 155 | +- Comprehensive end-to-end testing |
| 156 | +- Performance testing with cancellation controller ticker |
| 157 | +- Documentation and user guide updates |
| 158 | + |
| 159 | +Target timeline: 1-2 sprints for core implementation, with additional time for testing and documentation. |
| 160 | + |
| 161 | +## Open Issues |
| 162 | +- **PodVolumeBackup operation mapping**: Unlike DataUploads which are created by BackupItemActions with operationIDs, PodVolumeBackups are created directly and don't have a clear mapping to backup item operations. The current approach of finding PVBs by backup UID label should work but needs validation. |
| 163 | +- **Partial cancellation handling**: Determining the appropriate backup phase when some operations cancel successfully while others fail to cancel requires further investigation. |
0 commit comments