Skip to content

Conversation

@kaovilai
Copy link
Collaborator

@kaovilai kaovilai commented Sep 9, 2025

Summary

This PR fixes issue #8279 where BackupRepositories become stale when BackupStorageLocation (BSL) configuration is updated or created while Velero is not running.

Problem

When a BSL is updated/created while Velero is not running, existing BackupRepositories that reference the BSL become stale and continue using the old configuration. This prevents successful backups/restores until the repositories are manually deleted.

Solution

The implementation validates BackupRepository configurations against their associated BackupStorageLocation on controller startup. If BSL configuration (bucket, prefix, CACert, or config) has changed while Velero was not running, the affected repositories are invalidated and will be re-established.

Implementation Details

Core Changes:

  1. Startup Validation (validateBackupRepositoriesOnStartup):

    • Runs once on first reconciliation after Velero starts
    • Compares stored BSL configuration in repository annotations with current BSL configuration
    • Invalidates repositories where configuration has changed
  2. Configuration Tracking:

    • Stores BSL configuration (bucket, prefix, CACert hash, config) as annotations on BackupRepository resources
    • Uses annotations: velero.io/bsl-bucket, velero.io/bsl-prefix, velero.io/bsl-cacert-hash, velero.io/bsl-config
  3. Shared Comparison Logic (compareBSLConfigs):

    • Centralized function to compare BSL configurations
    • Used by both startup validation and runtime update detection
    • Eliminates code duplication between needInvalidBackupRepoOnStartup and needInvalidBackupRepo
  4. Thread Safety:

    • Uses mutex to ensure startup validation runs only once even with concurrent reconciliations
    • Validation runs asynchronously to avoid blocking reconciliation

Testing

Unit Tests Added:

  • TestValidateBackupRepositoriesOnStartup: Tests the startup validation logic with various scenarios
  • TestNeedInvalidBackupRepoOnStartup: Tests the comparison logic for startup validation

E2E Test Added:

  • New test file: test/e2e/bsl-mgmt/startup_validation.go
  • Test scenario simulates BSL configuration change while Velero is stopped:
    1. Creates a backup to establish a BackupRepository
    2. Scales down Velero deployment (simulating shutdown)
    3. Modifies BSL configuration (changes prefix)
    4. Scales up Velero deployment (simulating startup)
    5. Verifies repository is invalidated with correct error message
    6. Restores original BSL configuration
    7. Verifies repository recovers to Ready state

Files Changed:

  • pkg/controller/backup_repository_controller.go: Core implementation
  • pkg/controller/backup_repository_controller_test.go: Unit tests
  • pkg/apis/velero/v1/labels_annotations.go: BSL annotation constants
  • test/e2e/bsl-mgmt/startup_validation.go: E2E test
  • test/e2e/e2e_suite_test.go: Test registration
  • .github/workflows/e2e-test-kind.yaml: CI test matrix
  • changelogs/unreleased/8279-kaovilai: Changelog entry

Fixes

Fixes #8279

Test plan

  • Unit tests added and passing
  • E2E test added for startup validation scenario
  • Verified repositories are invalidated on startup when configs have changed
  • Verified repositories remain valid when configs haven't changed
  • All existing tests pass

Note

Responses generated with Claude

Scenario 1: BSL Changes While Velero IS Running (Runtime)

BSL Updated → Watcher triggered → invalidateBackupReposForBSL()
→ Patch repo to NotReady with msgBSLChanged
→ Reconcile request queued → checkNotReadyRepo() → DELETE

Code flow:

  1. Line 125-147: Controller watches BackupStorageLocation for updates
  2. Line 136-138: needInvalidBackupRepo predicate checks if bucket/prefix/config changed
  3. Line 150-175: invalidateBackupReposForBSL() patches repos to NotReady with msgBSLChanged
  4. Line 171: Returns reconcile requests for the affected repos
  5. Lines 786-793: checkNotReadyRepo() detects the message and deletes the repo

Scenario 2: BSL Changes While Velero is NOT Running (Startup), this PR.

Velero starts → validateBackupRepositoriesOnStartup()
→ Compare stored annotations with current BSL config
→ Patch repo to NotReady with msgBSLChangedOnStartup
→ Reconcile triggered → checkNotReadyRepo() → DELETE

Code flow:

  1. Lines 177-237: validateBackupRepositoriesOnStartup() runs on controller start
  2. Line 234: Patches mismatched repos to NotReady with msgBSLChangedOnStartup
  3. Lines 786-793: checkNotReadyRepo() detects the message and deletes the repo

Summary

Scenario Detection Message Final Action
Runtime (Velero running) BSL watcher + predicate msgBSLChanged DELETE
Startup (Velero was stopped) Startup validation msgBSLChangedOnStartup DELETE

@codecov
Copy link

codecov bot commented Sep 9, 2025

Codecov Report

❌ Patch coverage is 73.37278% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.64%. Comparing base (327ea3e) to head (5a362b7).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/backup_repository_controller.go 73.37% 33 Missing and 12 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9236      +/-   ##
==========================================
+ Coverage   60.58%   60.64%   +0.05%     
==========================================
  Files         386      386              
  Lines       36331    36480     +149     
==========================================
+ Hits        22012    22123     +111     
- Misses      12738    12764      +26     
- Partials     1581     1593      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator Author

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Community Meeting: check existing invalidation logic if it can be reused instead of adding more code.

@kaovilai kaovilai force-pushed the 8279 branch 3 times, most recently from a6b170b to 5a40e17 Compare September 10, 2025 01:22
@kaovilai kaovilai force-pushed the 8279 branch 7 times, most recently from 196bc7b to 4ad9578 Compare September 10, 2025 20:55
@kaovilai kaovilai force-pushed the 8279 branch 3 times, most recently from 06bc0c6 to 64a8e97 Compare November 5, 2025 14:43
kaovilai and others added 5 commits December 23, 2025 11:43
…elero is not running

This change validates BackupRepository configurations against their associated
BackupStorageLocation on controller startup. If BSL configuration (bucket, prefix,
CACert, or config) has changed while Velero was not running, the affected repositories
are invalidated and will be re-established.

Key changes:
- Add startup validation that checks all BackupRepositories against current BSL configs
- Store BSL configuration in BackupRepository annotations for comparison on startup
- Add shared compareBSLConfigs function to eliminate code duplication
- Move BSL annotation constants to labels_annotations.go for consistency
- Add comprehensive test coverage for startup validation logic

Fixes vmware-tanzu#8279

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
…elero is not running

This change validates BackupRepository configurations against their associated
BackupStorageLocation on controller startup. If BSL configuration (bucket, prefix,
CACert, or config) has changed while Velero was not running, the affected repositories
are invalidated and will be re-established.

Key changes:
- Add startup validation that checks all BackupRepositories against current BSL configs
- Store BSL configuration in BackupRepository annotations for comparison on startup
- Add shared compareBSLConfigs function to eliminate code duplication
- Move BSL annotation constants to labels_annotations.go for consistency
- Add comprehensive test coverage for startup validation logic

Fixes vmware-tanzu#8279

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
This commit adds an E2E test to verify that backup repositories are
properly validated against BSL configuration changes when Velero
restarts. The test simulates a scenario where BSL configuration changes
while Velero is not running and verifies that repositories are
invalidated on startup with the correct error message.

Test scenario:
1. Creates a backup to establish a BackupRepository
2. Scales down Velero deployment (simulating shutdown)
3. Modifies BSL configuration (changes prefix)
4. Scales up Velero deployment (simulating startup)
5. Verifies repository is invalidated with correct message
6. Restores original BSL configuration
7. Verifies repository recovers to Ready state

Changes:
- Added new E2E test file: test/e2e/bsl-mgmt/startup_validation.go
- Registered test in test/e2e/e2e_suite_test.go
- Added test label to GitHub workflow matrix for CI execution

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
Use pod status checks instead of fixed delays

Signed-off-by: Tiger Kaovilai <[email protected]>
- Introduced error messages for repository invalidation due to BSL changes.
- Updated logic to prevent reconnection of repositories invalidated by BSL changes during startup and runtime.
- Modified DeleteOldJobs function to include namespace parameter for better job management.
- Added tests to verify behavior of repositories invalidated by BSL changes.

Signed-off-by: Tiger Kaovilai <[email protected]>
kaovilai and others added 3 commits December 23, 2025 20:55
… allow new repository creation

Signed-off-by: Tiger Kaovilai <[email protected]>
When BSL configuration changes at runtime, invalidateBackupReposForBSL()
patches the BackupRepository status to NotReady and returns a reconcile
request. However, the reconcile could run before the informer cache was
updated, causing it to read the old Ready state. This resulted in a
5-minute delay until the periodic sync triggered correct NotReady handling.

The fix adds BSLLastInvalidatedAnnotation along with the status change.
This ensures:
1. The SpecChangePredicate passes the event (annotations changed)
2. The informer cache is properly updated before reconciliation
3. Immediate recovery occurs (< 1 second instead of 5 minutes)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

backupRepository can become stale if velero deployment is not running to observe bsl update/create

1 participant