Skip to content

Commit ade00a4

Browse files
committed
Add design doc for backup cancellation
1 parent 4535830 commit ade00a4

File tree

1 file changed

+8
-21
lines changed

1 file changed

+8
-21
lines changed

design/backup_cancellation.md

Lines changed: 8 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,18 @@
33

44
## Abstract
55
This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification.
6-
The design addresses GitHub issues #9189 and #2098 by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated.
6+
The design addresses GitHub issues [#9189](https://github.com/vmware-tanzu/velero/issues/9189
7+
) and [#2098](https://github.com/vmware-tanzu/velero/issues/2098) by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated.
78

89
## Background
910
Currently, Velero lacks the ability to cancel running backups, leading to several critical issues.
1011
When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate.
11-
Additionally, the backup deletion controller doesn't prevent running backups from being deleted, causing async operations (DataUpload, PodVolumeBackup, itemBlock processing) to continue running unaware of the backup deletion, resulting in resource contests and incomplete backup data leaks.
12+
Additionally, the backup deletion controller prevents running backups from being deleted.
1213

13-
This problem is particularly acute in environments with frequent scheduled backups or large-scale backup operations that may run for extended periods.
14-
Users currently have no way to abort problematic backups other than restarting the Velero server, which affects all ongoing operations.
1514

1615
## Goals
1716
- Enable users to cancel running backups through a `cancel` field in the backup specification
1817
- Cleanly cancel all associated async operations (BackupItemAction operations, DataUploads, PodVolumeBackups)
19-
- Prevent resource leaks and contests when backups are deleted or cancelled
2018
- Provide clear backup phase transitions (InProgress → Cancelling → Cancelled)
2119

2220
## Non Goals
@@ -30,7 +28,7 @@ The solution introduces a new `cancel` boolean field to the backup specification
3028
Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops.
3129

3230
A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work.
33-
This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handle DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase.
31+
This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handles DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase.
3432
The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution.
3533

3634
## Detailed Design
@@ -73,6 +71,7 @@ if backup.Spec.Cancel != nil && *backup.Spec.Cancel {
7371
return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups
7472
}
7573
```
74+
In addition, the `backup_operations_controller.go` will have a periodic check around backup progress updates, rather than running every time progress is updated to reduce API load.
7675

7776
#### New Backup Cancellation Controller
7877
Create `backup_cancellation_controller.go`:
@@ -110,22 +109,15 @@ For PodVolumeBackups (which lack BackupItemAction implementations):
110109
2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs
111110
3. Node-agent PodVolumeBackup controller handles actual cancellation
112111

113-
### Timing and Frequency
114-
- Use 5-second ticker for cancellation controller to prevent API overload
115-
- Existing controllers check cancellation on every reconcile (event-driven)
116-
- No timeout for cancellation operations (rely on existing operation timeouts)
117112

118113
## Alternatives Considered
119114

120-
### Alternative 1: Immediate Cancellation in Existing Controllers
121-
Instead of a dedicated cancellation controller, existing controllers could immediately cancel operations when detecting the cancel flag.
122-
This was rejected because it would complicate the existing controller logic and make the cancellation process less observable and debuggable.
123115

124-
### Alternative 2: Deletion-Based Cancellation
116+
### Alternative 1: Deletion-Based Cancellation
125117
Using backup deletion as the cancellation mechanism instead of a cancel field.
126118
This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning.
127119

128-
### Alternative 3: Timeout-Based Automatic Cancellation
120+
### Alternative 2: Timeout-Based Automatic Cancellation
129121
Automatically cancelling backups after a configurable timeout.
130122
This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation.
131123

@@ -153,11 +145,6 @@ Implementation will be done incrementally in the following phases:
153145

154146
**Phase 3**: Testing and refinement
155147
- Comprehensive end-to-end testing
156-
- Performance testing with cancellation controller ticker
148+
- Testing if slowdowns occur due to the frequency of checking `backup.Cancel` spec field
157149
- Documentation and user guide updates
158150

159-
Target timeline: 1-2 sprints for core implementation, with additional time for testing and documentation.
160-
161-
## Open Issues
162-
- **PodVolumeBackup operation mapping**: Unlike DataUploads which are created by BackupItemActions with operationIDs, PodVolumeBackups are created directly and don't have a clear mapping to backup item operations. The current approach of finding PVBs by backup UID label should work but needs validation.
163-
- **Partial cancellation handling**: Determining the appropriate backup phase when some operations cancel successfully while others fail to cancel requires further investigation.

0 commit comments

Comments
 (0)