Skip to content

Commit 4535830

Browse files
committed
Add design document
1 parent 3be76da commit 4535830

File tree

1 file changed

+163
-0
lines changed

1 file changed

+163
-0
lines changed

design/backup_cancellation.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
2+
# Backup Cancellation Design
3+
4+
## Abstract
5+
This proposal introduces user-initiated backup cancellation functionality to Velero, allowing users to abort running backups through a new `cancel` field in the backup specification.
6+
The design addresses GitHub issues #9189 and #2098 by providing a mechanism to cleanly cancel async operations and prevent resource leaks when backups need to be terminated.
7+
8+
## Background
9+
Currently, Velero lacks the ability to cancel running backups, leading to several critical issues.
10+
When users accidentally submit broad backup jobs (e.g., forgot to narrow resource selectors), the system becomes blocked and scheduled jobs accumulate.
11+
Additionally, the backup deletion controller doesn't prevent running backups from being deleted, causing async operations (DataUpload, PodVolumeBackup, itemBlock processing) to continue running unaware of the backup deletion, resulting in resource contests and incomplete backup data leaks.
12+
13+
This problem is particularly acute in environments with frequent scheduled backups or large-scale backup operations that may run for extended periods.
14+
Users currently have no way to abort problematic backups other than restarting the Velero server, which affects all ongoing operations.
15+
16+
## Goals
17+
- Enable users to cancel running backups through a `cancel` field in the backup specification
18+
- Cleanly cancel all associated async operations (BackupItemAction operations, DataUploads, PodVolumeBackups)
19+
- Prevent resource leaks and contests when backups are deleted or cancelled
20+
- Provide clear backup phase transitions (InProgress → Cancelling → Cancelled)
21+
22+
## Non Goals
23+
- Cancelling backups that have already completed or failed
24+
- Rolling back partially completed backup operations
25+
- Implementing cancellation for restore operations (future work)
26+
27+
28+
## High-Level Design
29+
The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation.
30+
Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops.
31+
32+
A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work.
33+
This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handle DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase.
34+
The design uses a 5-second ticker to prevent API overload and ensures clean separation between cancellation detection and execution.
35+
36+
## Detailed Design
37+
38+
### API Changes
39+
Add a new field to `BackupSpec`:
40+
```go
41+
type BackupSpec struct {
42+
// ... existing fields ...
43+
44+
// Cancel indicates whether the backup should be cancelled.
45+
// When set to true, Velero will attempt to cancel all ongoing operations
46+
// and transition the backup to Cancelled phase.
47+
// +optional
48+
Cancel *bool `json:"cancel,omitempty"`
49+
}
50+
```
51+
52+
Add new backup phases to `BackupPhase`:
53+
```go
54+
const (
55+
// ... existing phases ...
56+
BackupPhaseCancelling BackupPhase = "Cancelling"
57+
BackupPhaseCancelled BackupPhase = "Cancelled"
58+
)
59+
```
60+
61+
### Controller Changes
62+
63+
#### Existing Controllers
64+
Modify `backup_controller.go`, `backup_operations_controller.go`, and `backup_finalizer_controller.go` to check for cancellation:
65+
```go
66+
// Early in each Reconcile method
67+
if backup.Spec.Cancel != nil && *backup.Spec.Cancel {
68+
if backup.Status.Phase != BackupPhaseCancelling && backup.Status.Phase != BackupPhaseCancelled {
69+
backup.Status.Phase = BackupPhaseCancelling
70+
// Update backup and return
71+
return ctrl.Result{}, c.Client.Patch(ctx, backup, client.MergeFrom(original))
72+
}
73+
return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups
74+
}
75+
```
76+
77+
#### New Backup Cancellation Controller
78+
Create `backup_cancellation_controller.go`:
79+
```go
80+
type backupCancellationReconciler struct {
81+
client.Client
82+
logger logrus.FieldLogger
83+
itemOperationsMap *itemoperationmap.BackupItemOperationsMap
84+
newPluginManager func(logger logrus.FieldLogger) clientmgmt.Manager
85+
backupStoreGetter persistence.ObjectBackupStoreGetter
86+
}
87+
```
88+
89+
The controller will:
90+
1. Watch for backups in `BackupPhaseCancelling`
91+
2. Get operations from `itemOperationsMap.GetOperationsForBackup()`
92+
3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations
93+
4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true`
94+
5. Wait for all cancellations to complete
95+
6. Set backup phase to `BackupPhaseCancelled`
96+
7. Update backup metadata in object storage
97+
98+
### Cancellation Flow
99+
100+
#### BackupItemAction Operations
101+
For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions):
102+
1. Controller calls `bia.Cancel(operationID, backup)`
103+
2. CSI PVC action finds associated DataUpload and sets `du.Spec.Cancel = true`
104+
3. Node-agent DataUpload controller handles actual cancellation
105+
4. Operation marked as `OperationPhaseCanceled`
106+
107+
#### PodVolumeBackup Operations
108+
For PodVolumeBackups (which lack BackupItemAction implementations):
109+
1. Controller directly finds PVBs by backup UID label
110+
2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs
111+
3. Node-agent PodVolumeBackup controller handles actual cancellation
112+
113+
### Timing and Frequency
114+
- Use 5-second ticker for cancellation controller to prevent API overload
115+
- Existing controllers check cancellation on every reconcile (event-driven)
116+
- No timeout for cancellation operations (rely on existing operation timeouts)
117+
118+
## Alternatives Considered
119+
120+
### Alternative 1: Immediate Cancellation in Existing Controllers
121+
Instead of a dedicated cancellation controller, existing controllers could immediately cancel operations when detecting the cancel flag.
122+
This was rejected because it would complicate the existing controller logic and make the cancellation process less observable and debuggable.
123+
124+
### Alternative 2: Deletion-Based Cancellation
125+
Using backup deletion as the cancellation mechanism instead of a cancel field.
126+
This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning.
127+
128+
### Alternative 3: Timeout-Based Automatic Cancellation
129+
Automatically cancelling backups after a configurable timeout.
130+
This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation.
131+
132+
## Security Considerations
133+
The cancel field requires the same RBAC permissions as updating other backup specification fields.
134+
No additional security considerations are introduced as the cancellation mechanism reuses existing operation cancellation pathways that are already secured.
135+
136+
## Compatibility
137+
The new `cancel` field is optional and defaults to nil/false, ensuring backward compatibility with existing backup specifications.
138+
Existing backups will continue to work without modification.
139+
The new backup phases (`Cancelling`, `Cancelled`) are additive and don't affect existing phase transitions.
140+
141+
## Implementation
142+
Implementation will be done incrementally in the following phases:
143+
144+
**Phase 1**: API changes and basic cancellation detection
145+
- Add `cancel` field to BackupSpec
146+
- Add new backup phases
147+
- Update existing controllers to detect cancellation and transition to `Cancelling` phase
148+
149+
**Phase 2**: Cancellation controller implementation
150+
- Implement backup cancellation controller
151+
- Add BackupItemAction operation cancellation
152+
- Add PodVolumeBackup direct cancellation
153+
154+
**Phase 3**: Testing and refinement
155+
- Comprehensive end-to-end testing
156+
- Performance testing with cancellation controller ticker
157+
- Documentation and user guide updates
158+
159+
Target timeline: 1-2 sprints for core implementation, with additional time for testing and documentation.
160+
161+
## Open Issues
162+
- **PodVolumeBackup operation mapping**: Unlike DataUploads which are created by BackupItemActions with operationIDs, PodVolumeBackups are created directly and don't have a clear mapping to backup item operations. The current approach of finding PVBs by backup UID label should work but needs validation.
163+
- **Partial cancellation handling**: Determining the appropriate backup phase when some operations cancel successfully while others fail to cancel requires further investigation.

0 commit comments

Comments
 (0)