-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix BackupRepositories becoming stale when BSL config changes while Velero is not running #9236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9236 +/- ##
==========================================
+ Coverage 60.58% 60.64% +0.05%
==========================================
Files 386 386
Lines 36331 36480 +149
==========================================
+ Hits 22012 22123 +111
- Misses 12738 12764 +26
- Partials 1581 1593 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
kaovilai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Community Meeting: check existing invalidation logic if it can be reused instead of adding more code.
a6b170b to
5a40e17
Compare
196bc7b to
4ad9578
Compare
06bc0c6 to
64a8e97
Compare
…elero is not running This change validates BackupRepository configurations against their associated BackupStorageLocation on controller startup. If BSL configuration (bucket, prefix, CACert, or config) has changed while Velero was not running, the affected repositories are invalidated and will be re-established. Key changes: - Add startup validation that checks all BackupRepositories against current BSL configs - Store BSL configuration in BackupRepository annotations for comparison on startup - Add shared compareBSLConfigs function to eliminate code duplication - Move BSL annotation constants to labels_annotations.go for consistency - Add comprehensive test coverage for startup validation logic Fixes vmware-tanzu#8279 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Tiger Kaovilai <[email protected]>
…elero is not running This change validates BackupRepository configurations against their associated BackupStorageLocation on controller startup. If BSL configuration (bucket, prefix, CACert, or config) has changed while Velero was not running, the affected repositories are invalidated and will be re-established. Key changes: - Add startup validation that checks all BackupRepositories against current BSL configs - Store BSL configuration in BackupRepository annotations for comparison on startup - Add shared compareBSLConfigs function to eliminate code duplication - Move BSL annotation constants to labels_annotations.go for consistency - Add comprehensive test coverage for startup validation logic Fixes vmware-tanzu#8279 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Tiger Kaovilai <[email protected]>
This commit adds an E2E test to verify that backup repositories are properly validated against BSL configuration changes when Velero restarts. The test simulates a scenario where BSL configuration changes while Velero is not running and verifies that repositories are invalidated on startup with the correct error message. Test scenario: 1. Creates a backup to establish a BackupRepository 2. Scales down Velero deployment (simulating shutdown) 3. Modifies BSL configuration (changes prefix) 4. Scales up Velero deployment (simulating startup) 5. Verifies repository is invalidated with correct message 6. Restores original BSL configuration 7. Verifies repository recovers to Ready state Changes: - Added new E2E test file: test/e2e/bsl-mgmt/startup_validation.go - Registered test in test/e2e/e2e_suite_test.go - Added test label to GitHub workflow matrix for CI execution 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Tiger Kaovilai <[email protected]>
Use pod status checks instead of fixed delays Signed-off-by: Tiger Kaovilai <[email protected]>
- Introduced error messages for repository invalidation due to BSL changes. - Updated logic to prevent reconnection of repositories invalidated by BSL changes during startup and runtime. - Modified DeleteOldJobs function to include namespace parameter for better job management. - Added tests to verify behavior of repositories invalidated by BSL changes. Signed-off-by: Tiger Kaovilai <[email protected]>
… allow new repository creation Signed-off-by: Tiger Kaovilai <[email protected]>
… Changes Signed-off-by: Tiger Kaovilai <[email protected]>
When BSL configuration changes at runtime, invalidateBackupReposForBSL() patches the BackupRepository status to NotReady and returns a reconcile request. However, the reconcile could run before the informer cache was updated, causing it to read the old Ready state. This resulted in a 5-minute delay until the periodic sync triggered correct NotReady handling. The fix adds BSLLastInvalidatedAnnotation along with the status change. This ensures: 1. The SpecChangePredicate passes the event (annotations changed) 2. The informer cache is properly updated before reconciliation 3. Immediate recovery occurs (< 1 second instead of 5 minutes) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
This PR fixes issue #8279 where BackupRepositories become stale when BackupStorageLocation (BSL) configuration is updated or created while Velero is not running.
Problem
When a BSL is updated/created while Velero is not running, existing BackupRepositories that reference the BSL become stale and continue using the old configuration. This prevents successful backups/restores until the repositories are manually deleted.
Solution
The implementation validates BackupRepository configurations against their associated BackupStorageLocation on controller startup. If BSL configuration (bucket, prefix, CACert, or config) has changed while Velero was not running, the affected repositories are invalidated and will be re-established.
Implementation Details
Core Changes:
Startup Validation (
validateBackupRepositoriesOnStartup):Configuration Tracking:
velero.io/bsl-bucket,velero.io/bsl-prefix,velero.io/bsl-cacert-hash,velero.io/bsl-configShared Comparison Logic (
compareBSLConfigs):needInvalidBackupRepoOnStartupandneedInvalidBackupRepoThread Safety:
Testing
Unit Tests Added:
TestValidateBackupRepositoriesOnStartup: Tests the startup validation logic with various scenariosTestNeedInvalidBackupRepoOnStartup: Tests the comparison logic for startup validationE2E Test Added:
test/e2e/bsl-mgmt/startup_validation.goFiles Changed:
pkg/controller/backup_repository_controller.go: Core implementationpkg/controller/backup_repository_controller_test.go: Unit testspkg/apis/velero/v1/labels_annotations.go: BSL annotation constantstest/e2e/bsl-mgmt/startup_validation.go: E2E testtest/e2e/e2e_suite_test.go: Test registration.github/workflows/e2e-test-kind.yaml: CI test matrixchangelogs/unreleased/8279-kaovilai: Changelog entryFixes
Fixes #8279
Test plan
Note
Responses generated with Claude
Scenario 1: BSL Changes While Velero IS Running (Runtime)
BSL Updated → Watcher triggered → invalidateBackupReposForBSL()
→ Patch repo to NotReady with msgBSLChanged
→ Reconcile request queued → checkNotReadyRepo() → DELETE
Code flow:
Scenario 2: BSL Changes While Velero is NOT Running (Startup), this PR.
Velero starts → validateBackupRepositoriesOnStartup()
→ Compare stored annotations with current BSL config
→ Patch repo to NotReady with msgBSLChangedOnStartup
→ Reconcile triggered → checkNotReadyRepo() → DELETE
Code flow:
Summary