Skip to content

Conversation

@snowmead
Copy link
Contributor

@snowmead snowmead commented Dec 4, 2025

Summary

Adds a versioned migration framework for managing RocksDB column family schema changes. This enables safe removal of deprecated column families when upgrading existing databases, ensuring backward compatibility with older node installations.

Design

RocksDB requires all existing column families to be specified when opening a database. The migration system discovers existing CFs, opens the database, then drops deprecated ones via versioned migrations.

How it works:

  1. Discover existing column families via DB::list_cf()
  2. Open database with existing + current CFs
  3. Run pending migrations to drop deprecated CFs
  4. Update schema version (tracked in __schema_version__ CF)

Safety features:

  • Downgrade prevention (rejects newer schema versions)
  • Validates migration sequence (no gaps/duplicates)
  • Cleanup pass on startup handles partial failures

New Module Structure

client/common/src/rocksdb/
├── mod.rs          # Module exports and documentation
├── database.rs     # Database opening functions (open_db, open_db_with_migrations)
├── migrations.rs   # Migration trait, MigrationRunner, schema version tracking
└── tests.rs        # Comprehensive test coverage for migration scenarios

Notable Changes

  • New Migration trait and MigrationRunner in client/common/src/rocksdb/
  • TypedRocksDB::open_with_migrations() for stores needing migrations
  • TypedRocksDB::open() for stores without migrations
  • Updated BlockchainServiceStateStore, DownloadStateStore, BspPeerManagerStore

Test Coverage

Tests use temporary RocksDB instances to verify migration functionality including schema version tracking, CF dropping, and error handling.

Adding New Migrations

// 1. Implement the Migration trait
pub struct MyStoreV2Migration;

impl Migration for MyStoreV2Migration {
    fn version(&self) -> u32 { 2 }
    fn deprecated_column_families(&self) -> &'static [&'static str] {
        &["old_cf_to_remove"]
    }
    fn description(&self) -> &'static str { "Remove old_cf_to_remove" }
}

// 2. Register in your store's migrations function
pub fn my_store_migrations() -> Vec<Box<dyn Migration>> {
    vec![Box::new(MyStoreV1Migration), Box::new(MyStoreV2Migration)]
}

// 3. Open with migrations
let db = TypedRocksDB::open_with_migrations(&path, &CURRENT_CFS, my_store_migrations())?;

⚠️ Breaking Changes ⚠️

  • Short description
    A new v1 migration will be applied to the blockchain service state store to drop deprecated MSP respond storage request column families:
    • pending_msp_respond_storage_request
    • pending_msp_respond_storage_request_left_index
    • pending_msp_respond_storage_request_right_index
  • Who is affected
    • MSP node runners.
  • Suggested code changes
    No code changes required. The migration runs automatically on startup.

snowmead and others added 30 commits November 18, 2025 15:19
- Refactored the storage request submission process in `move-bucket.test.ts` to utilize a batch processing helper
…Postgres DB (#563)

* fix: 🩹 Remove and update old comments

* feat: 🚧 Add `blockchain-service-db` crate and postgres schema

* feat: 🚧 Wire pending tx DB updating to `send_extrinsic`

* feat: 🚧 Add CLI param for db URL, initialise DB on BS startup and update status with watcher

* fix: 🐛 Use i64 for nonce

* feat: ✨ Add pending tx postgres to integration test suite

* fix: 🐛 Wire CLI pending db param to blockchain service initialisation

* test: ✅ Fix passing CLI pending db param in test suites

* feat: ✨ Clear pending txs from DB in finality

* docs: 📝 Document functions in `store.rs` for pending DB

* feat: ✨ Log when a pending tx has a state update but for a different tx hash

* fix: 🐛 Initialise Blockchain Service last processed blocks with genesis

* test: ✅ Fix tests using old indexer db container name

* test: ✅ Add back backend container initialisation

* fix: 🐛 Remove duplicate indexer nodes

* test: ✅ Fix name change mistakenly

* test: ✨ Add new pending DB testing utilities

* fix: 🗑️ Remove deprecated `createApiObject`

* test: ✅ Add persistent pending Tx DB integration tests

* feat: 🚧 Add `load_account_with_states` query to enable re-subscription at startup

* feat: ✨ Re-watch transactions pending after restarting MSP

* test: ✅ Add test for not re-watching extrinsic with nonce below on-chain nonce

* feat: ✨ Add `watched` boolean field to pending db

* feat: ✨ Persist gap filling remark transactions

* fix: ✅ Fix race condition where container wasn't fully paused

* feat: ✨ Add pendingDbUrl option to `addBspContainer` as well

* refactor: 🚚 Rename `insert_sent` to `upsert_sent`

* feat: 🔥 Remove unused `load_active` function from pending db interface

* refactor: ♻️ Use `Vec<String>` directly in `load_resubscribe_rows` params

* feat: 🩹 Remove usage of `sent` state in pending DB

* refactor: ♻️ Use `TransactionStatus` in db interface functions

* fix: 🐛 Track transaciton in `transaction_manager` even with no `call_scale`

* refactor: ♻️ Use constants for container names of DBs and backend

* test: ✅ Add check after sealing block of pending tx not updating

* feat: ✨ Add message in remark fake tx

* fix: ✅ Use new constant instead of hardcoded postgres container name

* fix: 🐛 Resubscribe to pending txs in initial sync handling istead of startup

* fix: 🐛 Set all txs pending to `watched=false` then only those we re-watch back to `true`

* Revert "fix: 🐛 Resubscribe to pending txs in initial sync handling istead of startup"

This reverts commit df6af95.

* fix: 🐛 Try to watch in_block pending transacitons too

* fix: 🐛 Not filter by on-chain nonce when re-subscribing to pending txs

* test: ✅ Improve test error logging

* feat: ✨ Add custom error to submit_and_watch_extrinsic

* fix: 🩹 Log and skip when error in re-subscribing is old nonce

* fix: ✅ Consider node race condition in test
- Consolidate capacity management with single increase per batch
- Add batch trimming to fit within capacity limits
- Implement batch rejection with single extrinsic for efficiency
- Extract helper methods for file metadata construction
- Improve logging for batch processing visibility
- Clean up imports and remove unused file_key_cleanup field
- retry block sealing for checking msp acceptance
Add `msp: Option<(ProviderIdFor<T>, bool)>` field to the NewStorageRequest
event to propagate MSP assignment information through events. This allows
MSP clients to determine if a storage request was created for them and
whether they have already accepted it, without needing to query storage
request metadata from the chain. Prevents the MSP from reaccepting storage requests
The MSP could queue the same file key multiple times for acceptance,
causing MspAlreadyConfirmed errors when batches processed duplicate
entries. This occurred when multiple code paths (BatchProcessStorageRequests
and RemoteUploadRequest handlers) both called on_file_complete for the
same file.

Add a persistent HashSet (CFHashSetAPI) alongside the existing deque to
track pending file keys. Before queueing, check if the file key exists
in the set - skip if present, insert and queue if not. When popping from
the deque for batch processing, remove the file key from the set.
…status tracking

Replace batch-capacity-centric approach with an event-driven per-file processing model:

- Add FileKeyStatus enum (Processing, Accepted, Rejected, Failed, Abandoned) to track
  file key processing state across concurrent event handlers using Arc<RwLock<HashMap>>

- BatchProcessStorageRequests now emits NewStorageRequest events for each pending request
  via new PreprocessStorageRequest command, skipping already-processed/accepted/rejected
  keys and automatically retrying Failed ones

- NewStorageRequest handler performs per-file capacity management, storage creation, and
  P2P upload registration. If file already complete, immediately queues accept response

- ProcessMspRespondStoringRequest uses type-safe pallet_proofs_dealer::Error decoding to
  distinguish proof errors (mark Failed for retry) from non-proof errors (mark Abandoned)

- Move pending_respond_storage_requests queue from RocksDB to in-memory MspHandler struct,
  removing 4 column families (14 -> 10) since this state doesn't need persistence

- Remove batch_reject_storage_requests, ensure_batch_capacity, and trim_batch_to_fit_capacity
  methods as capacity is now managed per-file in NewStorageRequest handler
…grations

- Add downgrade prevention in `MigrationRunner::run_pending()` to reject
  databases created with newer schema versions than the current code supports
- Add `MigrationRunner::validate_migration_order()` to check for duplicate
  versions, proper sequencing starting from 1, and gaps
- Add `Clone` derive to `MigrationDescriptor` for flexibility
- Add TypedRocksDB migration integration tests covering fresh database
  creation, deprecated CF handling, downgrade prevention, and context usage
- Move migration tests from mod.rs to dedicated tests.rs file
@snowmead snowmead added D4-nicetohaveaudit⚠️ PR contains trivial changes to logic that should be properly reviewed. rocksdb-migrations Changes include migrations for RocksDB and removed D3-trivial👶 PR contains trivial changes that do not require an audit labels Dec 15, 2025
Make the Migration trait object-safe by using instance methods instead of
static methods and const VERSION. This eliminates the need for the
MigrationDescriptor type-erasing wrapper, simplifying the codebase.

Changes:
- Migration trait now uses fn version(&self), fn deprecated_column_families(&self),
  and fn description(&self) instead of const/static methods
- MigrationRunner::all_migrations() returns Vec<Box<dyn Migration>>
- Remove MigrationDescriptor struct entirely
- Update V1Migration and all tests to use the new trait signature
- Remove unused `all_deprecated_cfs` and `deprecated_cfs_up_to_version`
  functions that were only used in tests
- Add `validate_migration_order()` call at start of `run_pending()` to
  ensure migrations are validated before execution
- Fix `open_db_with_migrations` to use CURRENT file detection instead of
  swallowing all list_cf errors - properly distinguishes between new
  databases and corrupted existing databases
- Add test for RocksDB error propagation when CURRENT file exists but
  database is corrupted
… migrations

Add safety guardrails and improve the robustness of the RocksDB column
family migration system:

Guardrails:
- Reject `current_schema_cfs` containing deprecated CF names (permanently
  reserved after deprecation to prevent data confusion)
- Reject `current_schema_cfs` containing reserved `__schema_version__` CF
- Add `InvalidColumnFamilyConfig` error variant with actionable messages

Resilient cleanup:
- Refactor cleanup pass to only process migrations <= current schema version
- Always drop straggler deprecated CFs from already-applied migrations on
  startup (handles partial failures from crashes)
- Document why idempotent cleanup is used instead of transactional drops
  (RocksDB does not support batching multiple `drop_cf()` atomically)

New helper:
- Add `MigrationRunner::all_deprecated_column_families()` to aggregate
  deprecated CF names across all registered migrations

Tests:
- Add `cf_guardrail_tests` module for validation behavior
- Add `cleanup_resilience_tests` module for straggler cleanup scenarios
Remove 11 redundant tests that duplicated coverage already provided
by other tests in the migration test suite:

- 2 downgrade prevention tests (duplicated `prevents_downgrade`)
- 3 store simulation tests (covered by other migration tests)
- 2 cleanup resilience tests (covered by existing tests)
- 4 TypedRocksDB migration tests (duplicated tests.rs tests)

Kept `used_with_context` test as it uniquely validates TypedRocksDB
with TypedDbContext integration.
@snowmead snowmead added breaking Needs to be mentioned in breaking changes D5-needsaudit👮 PR contains changes to logic that should be properly reviewed and externally audited and removed not-breaking Does not need to be mentioned in breaking changes D4-nicetohaveaudit⚠️ PR contains trivial changes to logic that should be properly reviewed. labels Dec 15, 2025
snowmead and others added 8 commits December 19, 2025 08:35
- Move V1 migration from shc_common to blockchain-service crate
- Change MigrationRunner from static methods to instance-based
- Add TypedRocksDB::open() for stores without migrations
- Add TypedRocksDB::open_with_migrations() for stores with migrations
- Auto-sort migrations by version in MigrationRunner constructors
- Write version 0 explicitly in open_db() for consistency
- Use distinct test-only CF names to decouple tests from production
- Keep &'static str lifetime in all_deprecated_column_families()

Each store now defines and owns its own migrations, following the
principle of locality. Stores without migrations use open() which
writes version 0, ensuring all databases have consistent schema
version tracking and clean upgrade paths.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Rename `typed_rocks_db_migration_tests` to `typed_rocks_db_open_tests`
  to accurately reflect test scope (tests open methods, not migrations)
- Add `open_with_migrations_drops_deprecated_cfs_and_works_with_context`
  test that verifies TypedRocksDB::open_with_migrations() correctly
  drops deprecated column families and works with TypedDbContext
- Extract shared TestDataCf definition to module level

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ementations

- Remove `validate_custom_migrations` helper and update validation_tests
  to use `MigrationRunner::validate_order()` directly
- Remove `run_migrations_with_list` helper and update multi_version_tests
  to use `MigrationRunner::run_pending()` directly

This ensures tests verify the actual implementation behavior rather than
duplicated logic that could diverge from the real code.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
# Conflicts:
#	client/blockchain-service/src/state.rs
@snowmead snowmead marked this pull request as draft December 19, 2025 18:14
snowmead and others added 2 commits December 19, 2025 13:23
…ard compatibility

Include the deprecated `last_processed_block_number` column family in
CURRENT_COLUMN_FAMILIES to maintain backward compatibility with existing
RocksDB databases until the V2 migration is added in a separate PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@snowmead snowmead marked this pull request as ready for review December 19, 2025 18:37
@snowmead snowmead changed the title feat: rocksdb migrations feat: RocksDB Column Family Migration Framework Dec 19, 2025
@snowmead snowmead marked this pull request as draft December 19, 2025 19:28
Split the migrations module into a more logical rocksdb module structure:
- rocksdb/database.rs: Database opening functions and DatabaseError
- rocksdb/migrations.rs: Migration trait, runner, and error types
- rocksdb/mod.rs: Public exports

This separates concerns by domain - database opening logic is now in
database.rs while migration-specific code stays in migrations.rs.
@snowmead snowmead marked this pull request as ready for review December 19, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B5-clientnoteworthy Changes should be mentioned client-related release notes breaking Needs to be mentioned in breaking changes D5-needsaudit👮 PR contains changes to logic that should be properly reviewed and externally audited rocksdb-migrations Changes include migrations for RocksDB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants