You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: design/backup_cancellation.md
+8-21Lines changed: 8 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,20 +3,18 @@
3
3
4
4
## Abstract
5
5
This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification.
6
-
The design addresses GitHub issues #9189 and #2098 by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated.
6
+
The design addresses GitHub issues [#9189](https://github.com/vmware-tanzu/velero/issues/9189
7
+
) and [#2098](https://github.com/vmware-tanzu/velero/issues/2098) by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated.
7
8
8
9
## Background
9
10
Currently, Velero lacks the ability to cancel running backups, leading to several critical issues.
10
11
When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate.
11
-
Additionally, the backup deletion controller doesn't prevent running backups from being deleted, causing async operations (DataUpload, PodVolumeBackup, itemBlock processing) to continue running unaware of the backup deletion, resulting in resource contests and incomplete backup data leaks.
12
+
Additionally, the backup deletion controller prevents running backups from being deleted.
12
13
13
-
This problem is particularly acute in environments with frequent scheduled backups or large-scale backup operations that may run for extended periods.
14
-
Users currently have no way to abort problematic backups other than restarting the Velero server, which affects all ongoing operations.
15
14
16
15
## Goals
17
16
- Enable users to cancel running backups through a `cancel` field in the backup specification
@@ -30,7 +28,7 @@ The solution introduces a new `cancel` boolean field to the backup specification
30
28
Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops.
31
29
32
30
A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work.
33
-
This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handle DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase.
31
+
This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handles DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase.
34
32
The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution.
return ctrl.Result{}, nil// Skip processing for cancelling/cancelled backups
74
72
}
75
73
```
74
+
In addition, the `backup_operations_controller.go` will have a periodic check around backup progress updates, rather than running every time progress is updated to reduce API load.
76
75
77
76
#### New Backup Cancellation Controller
78
77
Create `backup_cancellation_controller.go`:
@@ -110,22 +109,15 @@ For PodVolumeBackups (which lack BackupItemAction implementations):
110
109
2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs
111
110
3. Node-agent PodVolumeBackup controller handles actual cancellation
112
111
113
-
### Timing and Frequency
114
-
- Use 5-second ticker for cancellation controller to prevent API overload
115
-
- Existing controllers check cancellation on every reconcile (event-driven)
116
-
- No timeout for cancellation operations (rely on existing operation timeouts)
117
112
118
113
## Alternatives Considered
119
114
120
-
### Alternative 1: Immediate Cancellation in Existing Controllers
121
-
Instead of a dedicated cancellation controller, existing controllers could immediately cancel operations when detecting the cancel flag.
122
-
This was rejected because it would complicate the existing controller logic and make the cancellation process less observable and debuggable.
123
115
124
-
### Alternative 2: Deletion-Based Cancellation
116
+
### Alternative 1: Deletion-Based Cancellation
125
117
Using backup deletion as the cancellation mechanism instead of a cancel field.
126
118
This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning.
127
119
128
-
### Alternative 3: Timeout-Based Automatic Cancellation
120
+
### Alternative 2: Timeout-Based Automatic Cancellation
129
121
Automatically cancelling backups after a configurable timeout.
130
122
This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation.
131
123
@@ -153,11 +145,6 @@ Implementation will be done incrementally in the following phases:
153
145
154
146
**Phase 3**: Testing and refinement
155
147
- Comprehensive end-to-end testing
156
-
-Performance testing with cancellation controller ticker
148
+
-Testing if slowdowns occur due to the frequency of checking `backup.Cancel` spec field
157
149
- Documentation and user guide updates
158
150
159
-
Target timeline: 1-2 sprints for core implementation, with additional time for testing and documentation.
160
-
161
-
## Open Issues
162
-
-**PodVolumeBackup operation mapping**: Unlike DataUploads which are created by BackupItemActions with operationIDs, PodVolumeBackups are created directly and don't have a clear mapping to backup item operations. The current approach of finding PVBs by backup UID label should work but needs validation.
163
-
-**Partial cancellation handling**: Determining the appropriate backup phase when some operations cancel successfully while others fail to cancel requires further investigation.
0 commit comments