-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Design: Backup Cancellation #9284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ade00a4 to
864c932
Compare
Signed-off-by: Joseph <jvaikath@redhat.com>
864c932 to
90a3a65
Compare
|
/kind changelog-not-required |
Signed-off-by: Joseph <jvaikath@redhat.com>
design/backup_cancellation.md
Outdated
| Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops. | ||
|
|
||
| A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work. | ||
| This controller will call `Cancel()` methods on all in-progress BackupItemAction operations (which automatically handles DataUpload cancellation), directly cancel PodVolumeBackups by setting their cancel flags, and finally transition the backup to `Cancelled` phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically related to the asynchronous phase of BIAs, since the Execute() calls will no longer be in progress once the backup reconciler exits. I would replace "on all in-progress BackupItemOperations" with "on all incomplete asynchronous BackupItemAction operations".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since PodVolumeBackups are handled synchronously, I don't think any of these will still be InProgress by the time the backup gets to the Cancelling" phase. Within the item backupper, upon completing an ItemBlock, we call waitUntilPVBsProcessed, checking PVB status every 5 seconds. You probably need a cancellation check here and if cancelled, set cancelled on PVB and exit the ItemBlock backup. So PVB cancellation will happen in the backup reconciler rather than in the cancellation reconciler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within BackupWithResolvers, when we detect cancellation we break out of the Item -> ItemBlock processing loop
I then plan to get and set the cancel spec on PodVolumeBackups associated with the backup within BackupWIthResolvers when cancellation is detected
waitUntilPVBsProcessed checks for either completion or cancellation phase
design/backup_cancellation.md
Outdated
| return ctrl.Result{}, nil // Skip processing for cancelling/cancelled backups | ||
| } | ||
| ``` | ||
| In addition, the `backup_operations_controller.go` will have a periodic check around backup progress updates, rather than running every time progress is updated to reduce API load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backup operations controller reconciles once every 10 seconds. It should be fine to check for cancellation here on every reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an API call to get the Backup resource every 10 seconds should not be an issue
If true set phase to BackupPhaseCancelling and backup_cancellation_controller takes it up
design/backup_cancellation.md
Outdated
| 1. Watch for backups in `BackupPhaseCancelling` | ||
| 2. Get operations from `itemOperationsMap.GetOperationsForBackup()` | ||
| 3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations | ||
| 4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably skip 4, since those run synchronously, so we can cancel those during the wait cycle in the item backupper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, post hook handling means that if cancellation isn't done before the BackupWithResolvers call, then all ItemBlocks and Pod volume backups will be processed
all or nothing
cancellation after BackupWithResolvers means that we only care about async BIAs
design/backup_cancellation.md
Outdated
| 2. Get operations from `itemOperationsMap.GetOperationsForBackup()` | ||
| 3. Call `bia.Cancel(operationID, backup)` on all in-progress BackupItemAction operations | ||
| 4. Find and cancel PodVolumeBackups by setting `pvb.Spec.Cancel = true` | ||
| 5. Wait for all cancellations to complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how long DataUpload cancel normally takes. We can't guarantee that third-party plugins will handle Cancel gracefully, though, so we may not want to wait after cancel until the operation registers completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, then you will need some way to tell the user that the backup is still cancelling and point to the DataUpload that is taking time IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shawn-hurley what about other async operations from external plugins? There's no guarantee that they even support cancellation. From the backup's point of view, we've called cancel, we can move on. I guess we have a few options:
- fire and forget. Call cancel, move on and cancel the backup, and maybe it's OK that the backup is cancelled while things like datauploads or external 3rd party actions are still cancelling.
- call cancel and block until the Progress call confirms that the operation has completed. This means that, for example, an InProgress backup would remain in progress until every cancelled data upload stops running, and for 3rd party plugins which can't actually force a cancel, waiting until they complete normally
- Similar to 2, but with a short-ish timeout. And yes, we could probably add an error to the backup for cancellation timeout. How long should we wait, though? 1 minute? 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards 1
It's about best effort
Trying to figure out what cleaning up completed DataUploads looks like
design/backup_cancellation.md
Outdated
|
|
||
| ### Alternative 1: Deletion-Based Cancellation | ||
| Using backup deletion as the cancellation mechanism instead of a cancel field. | ||
| This was rejected because it doesn't allow users to preserve the backup object for inspection after cancellation, and deletion has different semantic meaning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other reason to reject this is that deleting an object while another controller is still operating on it could lead to nondeterministic behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is just apart of Kubernetes, if the controller had to process something before deletion, then they need to add a finalizer.
What information will be on the cancelled backup that should be inspected, what is the user going to get from the canceled object that is useful.
If we are keeping cancelled backups, do we need some sort of pruning from object storage, I could imagine this list of cancelled backups could grow pretty fast.
If we do keep the backup, what other checks need to be put in place to make sure that you can't use a cancelled backup in restore?
In a similar vein, is there something that needs to be done differently on a delete of a cancelled backup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancelled backups may or may not have anything in object storage, but pruning object storage is already part of backup deletion. One option may be to have a separate (shorter) expiration cycle for cancelled backups. i.e. "completed backups have a ttl of 30 days, cancelled backups have a ttl of 5 days" -- then regular gc would clean up cancelled backups, they'll just be cleaned up sooner than completed ones.
Deleting a cancelled backup should be identical to deleting a completed backup, except that we may not have all of the components in object storage. i.e. if we cancel before uploading anything, there's nothing in object storage. If we cancel after the tarball and some data uploads have completed, then we have most of the s3 contents to delete. But basically, deleting a cancelled backup is just like deleting a regular backup as long as we don't error out if some expected object storage piece isn't there.
design/backup_cancellation.md
Outdated
|
|
||
| ### Alternative 2: Timeout-Based Automatic Cancellation | ||
| Automatically cancelling backups after a configurable timeout. | ||
| This was considered out of scope for the initial implementation as it addresses a different use case than user-initiated cancellation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we already have timeout-based backup failure (both PVB and async BIAs have timeouts that trigger failure. That should be sufficient for timeout-based behavior.
design/backup_cancellation.md
Outdated
| - Add new backup phases | ||
| - Update existing controllers to detect cancellation and transition to `Cancelling` phase | ||
|
|
||
| **Phase 2**: Cancellation controller implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that between Phase 1 and phase 2, "Cancelling" is essentially a terminal phase. Since we don't plan on having a release with only phase 1 in it, this should be fine.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #9284 +/- ##
==========================================
+ Coverage 59.64% 59.67% +0.03%
==========================================
Files 382 383 +1
Lines 43960 44050 +90
==========================================
+ Hits 26218 26289 +71
- Misses 16195 16215 +20
+ Partials 1547 1546 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
design/backup_cancellation.md
Outdated
| The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation. | ||
| Existing controllers (backup_controller, backup_operations_controller, backup_finalizer_controller) will check for this field and transition the backup to a `Cancelling` phase before returning early from their reconcile loops. | ||
|
|
||
| A new dedicated backup cancellation controller will watch for backups in the `Cancelling` phase and coordinate the actual cancellation work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a dedicated controller?
Logically, cancellation are phases of the backup CR as same as other phases, e.g., new, failed, inProgress, etc. It is more reasonable to process them all together.
Practically, if we allow multiple controllers to process the same backup CR, we must pay attention on the contest resolving, which is not necessary.
As existing examples, the controllers for PVB, PVR, DataUpload, DataDownload, handle cancellation in the same controller and reconciler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have more than one controller for the backup. There's the backup reconciler and the backup operations reconciler. But I think you're right -- we don't necessarily need to create a completely new one. We may not even need both "Canceling" and "Cancelled" phases -- basically the existing controllers see "cancel=true", stop work, cancel datauploads, set phase="cancelled"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dedicated controller could be avoid if the following are handled within the same controllers:
- async op cancellation:
backup_operations_controllercould do this - backup data deletion:
backup_deletion_controllerwould need to be invoked by creating a specialBackupDeleteRequestthat doesn't delete the logs and the backup resource itself
Would mean that multiple points in other controllers would need to do step 2 if cancel is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancel action and cancelled status is more suitable to be part of the primitive state machine for the backup CR, so the processing is better to be together with other primitive states, which is helpful for code understanding and code management.
Async operation is not with a action-state pattern and backup deletion controller is not for backup CR but for deleteBackupRequest CR.
design/backup_cancellation.md
Outdated
| - Implementing cancellation for restore operations (future work) | ||
|
|
||
|
|
||
| ## High-Level Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add a diagram to describe the phase transition of backup CR after cancellation is added.
- Please add more details to elaborate on how to deal with the data that have already been uploaded to the BSL when a backup is cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have 2 choices for BSL metadata:
- Rely on normal gc for expired backups, and perhaps facilitate this by having a shorter ttl for cancelled backups. For example, 30 day ttl for completed backups, 5 day (or shorter) ttl for cancelled backups.
- Clean up within the backup controller.
The second option has a couple problems. First, we need this logic both in the backup controller and in the backup operations controller. Second, we have to duplicate logic from the deletion controller. I think the shorter ttl may be the best approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using backup ttl will involve UX level concept into Velero's internal implementation logics. E.g., though unlikely, literally, users could change the ttl after Velero's cancellation logics set one.
Alternatively and similarly, how about we have the backupDeletion controller to directly check backups in Cancelled phase, once found, run the same backup deletion workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lyndon-Li If cancellation is initiated by the deletion controller because the user ran velero backup delete on a running backup, then yes, the deletion workflow will automatically remove the backup upon cancel. However, if the user is just trying to cancel (and not delete) the backup, then we do not want to run the deletion workflow immediately. We should either wait for normal expiration (or a subsequent user-initiated deletion request), or we could add a new cancellation ttl that's shorter.
We can't assume that it's safe to delete the backup immediately. One cancellation use case is that a large backup is already generating numerous errors -- for example, DataUpload failures or something similar -- so the user does not want to wait for the backup to complete, but they want to be able to diagnose the failures. In that case, they'd want to cancel the backup, then look into the failures, etc. (possibly generate a debug bundle or something similar), and then delete the backup. If velero always deletes the backup immediately after cancellation, they won't be able to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they'd want to cancel the backup, then look into the failures, etc. (possibly generate a debug bundle or something similar), and then delete the backup. If velero always deletes the backup immediately after cancellation, they won't be able to do that.
I agree for the backup itself (backup CR and other CRs it associates to, e.g., DataUpload, PodVolumeBackup), but it is not true for the backup data.
After backup is cancelled, the backup data is meaningless so it is not reasonable to leave the data to take the storage space, especially for some large backups which would possibly bring noticeable costs.
So I am wondering if it is a good idea to let the backupDeleteionController to delete the backup data in the storage/repository immediately after the backup gets to Cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lyndon-Li hmm. That makes some sense, but I'm not sure how we'd do that. Currently the BackupDeletionController only reconciles if there's a DeleteBackupRequest. We'd need to modify it to reconcile Backups too (i.e. the object type passed in would be one or the other, so the controller would need to deal with both) -- if a deletion request, follow current logic, if a backup, then ignore unless cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe we need to add another controller, so as to make the backupDeletionController clean, but we for sure need to share the common functions between the two.
Adding another controller is fine for this purpose because it doesn't change the state of the backup CRs, it just monitor and react.
Anyway, I think it is not a good idea to use the UX level interfaces to complete the internal workflows, e.g., use the TTL mechanism or creating a deleteBackupRequest.
design/backup_cancellation.md
Outdated
| ### Cancellation Flow | ||
|
|
||
| #### BackupItemAction Operations | ||
| For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about BIA V1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't have any operations to cancel. Once the item backupper completes, it's done. We should check status.cancel before backing up an item (but only make apiserver calls to refresh the backup periodically, not on every item) -- and if cancel is true, then we return immediately and don't call the BIA at all. For an item that is currently being backed up when we refresh to get cancel=true, we don't need to do any action. Wait for it to complete (it's just one item)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. So let's add some workflow in the design to elaborate on how we add this checkpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added how this would be done within backup_controller -> BackupWithResolver which is where most synchronous Item Itemblock processing occurs
design/backup_cancellation.md
Outdated
| 1. Controller directly finds PVBs by backup UID label | ||
| 2. Sets `pvb.Spec.Cancel = true` on in-progress PVBs | ||
| 3. Node-agent PodVolumeBackup controller handles actual cancellation | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other backup operations, e.g, hooks, native snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to check status before invoking the native snapshot, and in any polling code waiting for completion. But there is no cancel call in the native snapshot plugin API, so we may need to wait for any already-started snapshots to complete and document this behavior.
For hooks, as with item backup, we need to check for cancellation before invoking a hook. but we don't need to deal with in-progress hooks since those aren't long-running operations and they're done synchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. So let's add some workflow in the design to elaborate on how we add this checkpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it cancellation just stops sending ItemBlocks to the worker pool i.e. it's more of an early exit, and then setting PVB spec cancel fields to true
BackupWithResolvers would return and so the hooks would run
Native snapshots are synchronous, so we can't cancel them, only avoid running if we see cancel is true by early exit
design/backup_cancellation.md
Outdated
|
|
||
|
|
||
| ## High-Level Design | ||
| The solution introduces a new `cancel` boolean field to the backup specification that users can set to `true` to request cancellation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a spec value and not use a the finalizer and a delete on the backup?
If a back up is canceled are we considering that this backup should not be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't necessarily think we want immediate deletion. Just like we don't immediately delete failed backups or failed validation. One valid use case is that I ran a big backup and I'm seeing my datauploads all failing consistently, so I cancel it, but I want to debug it before deleting the backup. If canceling nukes the backup immediately, then I lose that ability. What about just setting a shorter ttl upon cancellation? Canceled backups are deleted in a few days by default, users can make them delete in an hour, for example, if they want to via config (shorter than that won't make sense, since I think the GC controller only runs hourly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have some doubt whether we wanna expose the "cancel" option to user and allow the user to cancel the backup, and keeps it in the object storage as "cancelled".
When reviewing issue #9189, I used to think the cancellation should only be triggered when the user deletes a backup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonerjt One reason we want cancellation outside of delete is for debugging. If an in-progress backup is already accumulating errors, a user may want to cancel the backup (since they need to fix something), but they may not want to delete it right away, as they'll lose the ability to inspect the CRs for the failed backup.
Since cancel will need to be implemented to allow deleting running backups anyway, and the cancel functionality will be largely separated from delete, I don't see any reason not to expose it to the user. I know we've had requests for cancellation in the past. If we don't expose it now, we'll continue to get requests for it. I don't see any downside here. We have the functionality anyway, why not let the user cancel? It seems to be an expected operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're balancing two important, competing user needs:
- The ability to cancel a backup but keep the CR for post-mortem debugging
- The need to avoid orphaned data and unnecessary costs in object storage from the incomplete backup
What if we separate the lifecycle of the backup CR from the lifecycle of its data in object storage? A possible flow could be:
- A user initiates a cancellation by setting
spec.cancel: true. - The backup controller transitions the backup to a
Cancellingphase and gracefully stops all ongoing operations. - Once all operations are terminated, the controller immediately triggers a cleanup of the associated data in the backup storage location. This could be handled by a dedicated cleanup controller or by extending the
backupDeletionControllerto watch for backups in theCancelledphase. - After the storage cleanup is complete, the backup CR itself is moved to the final
Cancelledphase. The CR remains in the cluster for inspection, but its associated storage artifacts are gone. We could even add a status condition likeBackupDataDeletedto make this explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should keep the backup logs
Agree, we should keep the backup associated data generated by Velero itself (non backup data). Their lifecycle should be the same as the the backup CRs in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To conclude:
- For backup data, they should be deleted as the process of Cancel operation. In another word, once the backup is set as Cancelled, the backup data should have been deleted. We need a new functionality to make it, the existing TTL should not and can not be used.
- For backup associated data, they should be deleted along with the backup CRs. So the existing TTL mechanism could cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need a phase marking cancel is starting but not done deleting data? that way we end up with Cancelled=deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaovilai That's the "cancelling" phase. i.e. backup controller marks backup "Cancelling", exits, then cancel controller does its thing and marks it "Cancelled"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented the phases the backup_cancellation_controller would go through
So starts at BackupPhaseCancelling
When async cancels and deletion is done, sets to terminal phase BackupPhaseCancelled
design/backup_cancellation.md
Outdated
|
|
||
| **Future Work**: | ||
|
|
||
| - Replacing logic that blocks deletion of in-progress backups with the cancellation flow, followed by usual deletion for a terminal phase (Cancelled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backup deletion controller currently just bails if you try to delete an InProgress backup. The goal here is if a user creates a DeleteBackupRequest for a running backup, the backup deletion controller would make the cancel call on the backup and requeue with backoff. Once the backup has moved to the terminal state (based on all of the above here), then the deletion controller can perform its usual delete action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This provides 2 ways of using cancellation, depending on what the user needs:
- cancel and immediate deletion: No need to set the spec field manually, just do the normal delete operation -- i.e. running
velero delete backupwhich creates a DeleteBackupRequest, which then cancels the backup, and then deletes once cancellation has happened - Cancel without immediate delete. User just sets the spec field to cancel
Use case for 1) is "I didn't mean to back up that namespace with 1000 big volumes in it"
Use case for 2) is "I'm seeing lots of errors, there's no point in letting this go any further, but I want to examine the errors before deletion" -- basically the same use case as today's "wait for the backup to fail and then check out the errors, etc."
design/backup_cancellation.md
Outdated
|
|
||
| #### BackupItemAction Operations | ||
| For operations with BackupItemAction v2 implementations (e.g., CSI PVC actions): | ||
| 1. Controller calls `bia.Cancel(operationID, backup)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bia.Cancel may return an error which indicates the BIA doesn't support cancel, in this case, we still not some way to wait for its completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do call this in a best-effort way when the ops timeout within backup_operations_controller for timeouts
velero/pkg/controller/backup_operations_controller.go
Lines 363 to 364 in 99f12b8
| if operation.Status.Created.Time.Add(backup.Spec.ItemOperationTimeout.Duration).Before(time.Now()) { | |
| _ = bia.Cancel(operation.Spec.OperationID, backup) |
Could fire-and-forget
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this case, we need to rely on the existing logics to wait the action complete or timeout. So we need to cover the case in the design.
Signed-off-by: Joseph <jvaikath@redhat.com>
0e9fa67 to
359b293
Compare
Signed-off-by: Joseph <jvaikath@redhat.com>
354c782 to
4afa1ae
Compare
Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
Thank you for contributing to Velero!
Please add a summary of your change
Design document for backup cancellation feature
Does your change fix a particular issue?
Works towards adding cancellation support, would work towards
#9189
#2098
Please indicate you've done the following:
make new-changelog) or comment/kind changelog-not-requiredon this PR.site/content/docs/main.