Skip to content

Commit 90a3a65

Browse files
committed
Add design document
Signed-off-by: Joseph <[email protected]>
1 parent 3be76da commit 90a3a65

File tree

1 file changed

+150
-0
lines changed

1 file changed

+150
-0
lines changed

design/backup_cancellation.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
2+
# Backup Cancellation Design
3+
4+
## Abstract
5+
This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification.
6+
The design addresses GitHub issues [#9189](https://github.com/vmware-tanzu/velero/issues/9189
7+
) and [#2098](https://github.com/vmware-tanzu/velero/issues/2098) by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated.
8+
9+
## Background
10+
Currently, Velero lacks the ability to cancel running backups, leading to several critical issues.
11+
When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate.
12+
Additionally, the backup deletion controller prevents running backups from being deleted.
13+
14+
15+
## Goals
16+
- Enable users to cancel running backups through a `cancel` field in the backup specification
17+
- Cleanly cancel all associated async operations (BackupItemAction operations, DataUploads, PodVolumeBackups)
18+
- Provide clear backup phase transitions (InProgress → Cancelling → Cancelled)
19+
20+
## Non Goals
21+
- Cancelling backups that have already completed or failed
22+
- Rolling back partially completed backup operations
23+
- Implementing cancellation for restore operations (future work)
24+
25+
26+
## High-Level Design
27+
The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation.
28+
Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops.
29+
30+
A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work.
31+
This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handles DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase.
32+
The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution.
33+
34+
## Detailed Design
35+
36+
### API Changes
37+
Add a new field to `BackupSpec`:
38+
```go
39+
type BackupSpec struct {
40+
// ... existing fields ...
41+
42+
// Cancel indicates whether the backup should be cancelled.
43+
// When set to true, Velero will attempt to cancel all ongoing operations
44+
// and transition the backup to Cancelled phase.
45+
// +optional
46+
Cancel *bool `json:"cancel,omitempty"`
47+
}
48+
```
49+
50+
Add new backup phases to `BackupPhase`:
51+
```go
52+
const (
53+
// ... existing phases ...
54+
BackupPhaseCancelling BackupPhase = "Cancelling"
55+
BackupPhaseCancelled BackupPhase = "Cancelled"
56+
)
57+
```
58+
59+
### Controller Changes
60+
61+
#### Existing Controllers
62+
Modify `backup_controller.go`, `backup_operations_controller.go`, and `backup_finalizer_controller.go` to check for cancellation:
63+
```go
64+
// Early in each Reconcile method
65+
if backup.Spec.Cancel != nil && *backup.Spec.Cancel {
66+
if backup.Status.Phase != BackupPhaseCancelling && backup.Status.Phase != BackupPhaseCancelled {
67+
backup.Status.Phase = BackupPhaseCancelling
68+
// Update backup and return
69+
return ctrl.Result{}, c.Client.Patch(ctx, backup, client.MergeFrom(original))
70+
}
71+
return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups
72+
}
73+
```
74+
In addition, the `backup_operations_controller.go` will have a periodic check around backup progress updates, rather than running every time progress is updated to reduce API load.
75+
76+
#### New Backup Cancellation Controller
77+
Create `backup_cancellation_controller.go`:
78+
```go
79+
type backupCancellationReconciler struct {
80+
client.Client
81+
logger logrus.FieldLogger
82+
itemOperationsMap *itemoperationmap.BackupItemOperationsMap
83+
newPluginManager func(logger logrus.FieldLogger) clientmgmt.Manager
84+
backupStoreGetter persistence.ObjectBackupStoreGetter
85+
}
86+
```
87+
88+
The controller will:
89+
1. Watch for backups in `BackupPhaseCancelling`
90+
2. Get operations from `itemOperationsMap.GetOperationsForBackup()`
91+
3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations
92+
4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true`
93+
5. Wait for all cancellations to complete
94+
6. Set backup phase to `BackupPhaseCancelled`
95+
7. Update backup metadata in object storage
96+
97+
### Cancellation Flow
98+
99+
#### BackupItemAction Operations
100+
For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions):
101+
1. Controller calls `bia.Cancel(operationID, backup)`
102+
2. CSI PVC action finds associated DataUpload and sets `du.Spec.Cancel = true`
103+
3. Node-agent DataUpload controller handles actual cancellation
104+
4. Operation marked as `OperationPhaseCanceled`
105+
106+
#### PodVolumeBackup Operations
107+
For PodVolumeBackups (which lack BackupItemAction implementations):
108+
1. Controller directly finds PVBs by backup UID label
109+
2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs
110+
3. Node-agent PodVolumeBackup controller handles actual cancellation
111+
112+
113+
## Alternatives Considered
114+
115+
116+
### Alternative 1: Deletion-Based Cancellation
117+
Using backup deletion as the cancellation mechanism instead of a cancel field.
118+
This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning.
119+
120+
### Alternative 2: Timeout-Based Automatic Cancellation
121+
Automatically cancelling backups after a configurable timeout.
122+
This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation.
123+
124+
## Security Considerations
125+
The cancel field requires the same RBAC permissions as updating other backup specification fields.
126+
No additional security considerations are introduced as the cancellation mechanism reuses existing operation cancellation pathways that are already secured.
127+
128+
## Compatibility
129+
The new `cancel` field is optional and defaults to nil/false, ensuring backward compatibility with existing backup specifications.
130+
Existing backups will continue to work without modification.
131+
The new backup phases (`Cancelling`, `Cancelled`) are additive and don't affect existing phase transitions.
132+
133+
## Implementation
134+
Implementation will be done incrementally in the following phases:
135+
136+
**Phase 1**: API changes and basic cancellation detection
137+
- Add `cancel` field to BackupSpec
138+
- Add new backup phases
139+
- Update existing controllers to detect cancellation and transition to `Cancelling` phase
140+
141+
**Phase 2**: Cancellation controller implementation
142+
- Implement backup cancellation controller
143+
- Add BackupItemAction operation cancellation
144+
- Add PodVolumeBackup direct cancellation
145+
146+
**Phase 3**: Testing and refinement
147+
- Comprehensive end-to-end testing
148+
- Testing if slowdowns occur due to the frequency of checking `backup.Cancel` spec field
149+
- Documentation and user guide updates
150+

0 commit comments

Comments
 (0)