feat: checkpoint Walrus RocksDB database #2020

liquid-helium · 2025-04-25T05:28:54Z

Description

Added a CheckpointManager component, which provides
- Periodic checkpoint of the main database
- Interface for manual checkpoint
Introduced a node admin socket so that operators can interact directly with a running node
- Added Checkpoint commands in walrus-node command tool

contributes to WAL-842
Contributes to #1920

Test plan

How did you test the new or updated feature?

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)

Storage node:

Added a CheckpointManager component, which provides
- Periodic checkpoint of the main database
- Interface for manual checkpoint
Introduced a node admin socket so that operators can interact directly with a running node
- Added Checkpoint commands in walrus-node command tool

Aggregator:
Publisher:
CLI:

…up-restore

github-actions · 2025-04-25T05:31:24Z

Warning: This PR modifies one of the example config files. Please consider the
following:

Make sure the changes are backwards compatible with the current configuration.
Make sure any added parameters follow the conventions of the existing parameters; in
particular, durations should take seconds or milliseconds using the naming convention
_secs or _millis, respectively.
If there are added optional parameter sections, it should be possible to specify them
partially. A useful pattern there is to implement Default for the struct and derive
#[serde(default)] on it, see BlobRecoveryConfig as an example.
You may need to update the documentation to reflect the changes.

…up-restore

crates/walrus-service/src/node/checkpoint.rs

scripts/local-testbed.sh

mlegner

Thanks for implementing this, @liquid-helium! I have two initial high-level comments. As I'm not too familiar with RocksDB checkpoints, nor with UNIX sockets in Rust, I'd prefer if somebody else would take a closer look at the details there.

crates/walrus-service/bin/node.rs

sadhansood · 2025-05-06T16:47:47Z

Thanks @liquid-helium for this change! I have one question - what happens to the checkpoints that we create periodically or on-demand? Wouldn't it increase the disk usage once created if not copied out to some archive/cold storage solution as time goes on and increase operational burden. One way we dealt with this in Sui is in db_checkpoint_handler.rs where we also manage the second part of moving the local db into a remote store once it is backed up.

liquid-helium · 2025-05-06T17:16:42Z

Thanks @liquid-helium for this change! I have one question - what happens to the checkpoints that we create periodically or on-demand? Wouldn't it increase the disk usage once created if not copied out to some archive/cold storage solution as time goes on and increase operational burden. One way we dealt with this in Sui is in db_checkpoint_handler.rs where we also manage the second part of moving the local db into a remote store once it is backed up.

Thanks @sadhansood for raising this up, that's a good example to follow if we decide to do the upload of backup :)
The disk usage is definitely something we need to consider, and storing checkpoint in the same storage is a counter-pattern.
In short, I am currently thinking the operators add a local disk or some other external storage that can be accessed directly via a path, and the checkpoint will be created directly in that path. Only a number of checkpoints are kept (currently default to 3).

sadhansood · 2025-05-06T20:17:50Z

Makes sense to do it on a different storage locally. I think we should add documentation on how to set this up. But otherwise, the PR looks good and I will try to review in more detail next (I do have some questions on incremental backup as well as restores). It might be nice to set up a meeting to discuss this with some of the team members since it is a bigger change overall. Thanks @liquid-helium!

mario4tier · 2025-05-07T18:01:11Z

... The disk usage is definitely something we need to consider, and storing checkpoint in the same storage is a counter-pattern.

In our case, creating fast and cheap hard link checkpoints on same storage would be preferred. A DB checkpoint would be followed by a ZFS snapshot of the whole storage (data+checkpoints).

The backup to a separate storage would then be the node operator responsibility (ZFS send/receive in our case).

My point being: consider keeping the feature of creating/recovering from hard links checkpoints ... and develop remote backup as a separate optional step/feature.

(seems you are already "there", just sharing my use case)

liquid-helium · 2025-05-07T18:40:19Z

... The disk usage is definitely something we need to consider, and storing checkpoint in the same storage is a counter-pattern.

In our case, creating fast and cheap hard link checkpoints on same storage would be preferred. A DB checkpoint would be followed by a ZFS snapshot of the whole storage (data+checkpoints).

The backup to a separate storage would then be the node operator responsibility (ZFS send/receive in our case).

My point being: consider keeping the feature of creating/recovering from hard links checkpoints ... and develop remote backup as a separate optional step/feature.

(seems you are already "there", just sharing my use case)

Hi @mario4tier, thank you for your comments—it's a great suggestion. Feel free to create a separate issue to track this and assign it to me. The current implementation focuses on scenarios where the disk is damaged, using BackupEngine for recovery. However, I agree a lightweight checkpoint using hard links could address other scenarios. I'll look into that based on priority. Thanks!

mario4tier · 2025-05-08T15:45:32Z

Feel free to create a separate issue to track this and assign it to me.

I felt that implementing DB checkpoints management (e.g. trigger, recovery...) is a common layer (for all use case) upon which the additional remote backup features would be built.

So I was hopeful that "relatively simpler" hard link checkpoints management could be exposed on the CLI as part of (or even before) doing the "more complicated" remote backup 😄

Not sure if I am the right person to create another PR (in addition of #1920). I prefer to leave this to you as you have more insight on how it is to be implemented.

mario4tier · 2025-05-08T17:10:41Z

| The current implementation focuses on scenarios where the disk is damaged, using BackupEngine for recovery.

Good, can't dispute the value of backups 😄

The rest is my opinion about priority of implementing hard link checkpoints versus remote backup:

Worst-case scenario

DB corruptions can have a bigger impact than drive failures. The most dramatic would be a walrus-node bug/upgrade that would affect multiple operators "at once".
There were multiple instances of DB corruptions, likely due to walrus-node crash/kill, but not fully understand (AFAIK).
Drive failures are already mitigated (to different degree) by many operators. It sucks when it happens, but it has a low probability of global impact.

Hard-links checkpoint is the "cheapest" solution against the "worst" scenario that we do not fully control/understand.

And "cheap" means also a better chance of being actually enabled by most operators on short term (if one did not bother already with RAID-like protection on live network, then lower chance it will invest into costly remote backup soon).

===

Just to be clear, I expect most responsible operator to want, and eventually use remote backups.

liquid-helium · 2025-05-08T17:40:08Z

rruptions can

I agree that a checkpoint is a cost-effective and efficient way to prevent many corruption issues. I considered both Checkpoint and Backup options, aiming to use only one to protect against corruption instead of maintaining two separate features. (We can implement remote backups on top of checkpoints later, but I prefer using RocksDB's native backup engine to manage 'shared files' when multiple backups are involved).

The reason I chose BackupEngine is primarily because it provides a comprehensive, all-in-one solution—despite the cost of copying data to an external disk—and second, because of concerns about space usage from checkpoints, especially in worst-case scenarios where many SST files are compacted after creating a checkpoint.

That said, I definitely see the value of a checkpoint from an operator's perspective. A CLI to create checkpoints before upgrades is a particularly good idea, and I appreciate your suggestion.

I will prioritize the checkpoint feature after completing this one.

mario4tier · 2025-05-08T19:30:24Z

Thanks for doing all this!

Details
In time, as you see fit, I can share measurements of daily data "turnover" (by design, this is easy to monitor with daily ZFS snapshots). That includes the cost of keeping a reference on "older versions" of the SST files.

In short, for our 21 mainnet shards, it currently costs 100GB to 300GB per day to keep holding a snapshot of the whole walrus/db directory... with the highest numbers when a compaction occurred (as you expected).

halfprice

Thanks a lot @liquid-helium and this looks amazing!

I primarily looked at the execution logic, and I only have some nits with one important comment that we should put all RocksDb operation on a blocking thread. Other than that, LGTM!

I don't really know much detail about RocksDb backup creation so I'll rely on @sadhansood to review that part.

crates/walrus-service/src/node/db_checkpoint.rs

crates/walrus-service/bin/node.rs

crates/walrus-service/src/node/config.rs

crates/walrus-service/bin/node.rs

crates/walrus-service/src/node/db_checkpoint.rs

sadhansood

Thanks @liquid-helium, this is a solid PR for backing up walrus data!

crates/walrus-service/bin/node.rs

crates/walrus-service/src/node/db_checkpoint.rs

…up-restore

crates/walrus-service/src/node/db_checkpoint.rs

sadhansood

lgtm @liquid-helium, thanks for your amazing work on this!

halfprice

LGTM 🚀🚀🚀

…up-restore

liquid-helium added 5 commits April 21, 2025 18:57

Checkpoint.

6dcbd59

checkpoint.

966e7ed

Merge branch 'main' of https://github.com/MystenLabs/walrus into back…

8bcd657

…up-restore

Implemented checkpoint.

c560039

Merge branch 'main' of https://github.com/MystenLabs/walrus into back…

44ede44

…up-restore

Cleanup.

efc32ab

liquid-helium changed the title ~~Backup restore~~ feat: checkpoint Walrus RocksDB database Apr 26, 2025

liquid-helium added 5 commits April 25, 2025 17:53

Added delay to the admin backup creation command.

90f9379

Merge branch 'main' of https://github.com/MystenLabs/walrus into back…

fa1054f

…up-restore

merge.

295e02f

cleanup.

5ab3cf1

Merge branch 'main' of https://github.com/MystenLabs/walrus into back…

951fb89

…up-restore

liquid-helium requested review from sadhansood, jpcsmith and halfprice April 30, 2025 07:34

wbbradley reviewed Apr 30, 2025

View reviewed changes

mlegner reviewed May 5, 2025

View reviewed changes

crates/walrus-service/bin/node.rs Outdated Show resolved Hide resolved

crates/walrus-service/bin/node.rs Show resolved Hide resolved

liquid-helium added 6 commits May 6, 2025 18:37

renamed to db_checkpoint.rs

3f4b157

rename1

c2a9db6

rename file.

489c582

cleanup

76f2094

rename

f96f2cf

t.

53955d1

liquid-helium mentioned this pull request May 8, 2025

feat: blob sliver data existence check #1997

Merged

4 tasks

mlegner mentioned this pull request May 9, 2025

feat: Allow updating logging verbosity dynamically at runtime #1829

Open

liquid-helium removed the request for review from jpcsmith May 16, 2025 20:33

halfprice reviewed May 17, 2025

View reviewed changes

sadhansood reviewed May 19, 2025

View reviewed changes

liquid-helium added 3 commits May 23, 2025 13:34

Resolved comments.

6b008e4

resolved comments.

9994fd1

Merge branch 'main' of https://github.com/MystenLabs/walrus into back…

fa926ca

…up-restore

liquid-helium requested review from halfprice and sadhansood May 24, 2025 02:56

sadhansood reviewed May 27, 2025

View reviewed changes

crates/walrus-service/src/node/db_checkpoint.rs Show resolved Hide resolved

sadhansood approved these changes May 27, 2025

View reviewed changes

halfprice approved these changes May 28, 2025

View reviewed changes

liquid-helium added 2 commits May 29, 2025 20:49

Merge branch 'main' of https://github.com/MystenLabs/walrus into back…

d59a7ac

…up-restore

update.

df9caf0

liquid-helium merged commit b6f6d6c into main May 30, 2025
26 checks passed

liquid-helium deleted the backup-restore branch May 30, 2025 16:05

feat: checkpoint Walrus RocksDB database #2020

feat: checkpoint Walrus RocksDB database #2020

Uh oh!

Conversation

liquid-helium commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test plan

Release notes

Uh oh!

github-actions bot commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mlegner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sadhansood commented May 6, 2025

Uh oh!

liquid-helium commented May 6, 2025

Uh oh!

sadhansood commented May 6, 2025

Uh oh!

mario4tier commented May 7, 2025

Uh oh!

liquid-helium commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mario4tier commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mario4tier commented May 8, 2025

Worst-case scenario

Uh oh!

liquid-helium commented May 8, 2025

Uh oh!

mario4tier commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

halfprice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadhansood left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadhansood left a comment

Choose a reason for hiding this comment

Uh oh!

halfprice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liquid-helium commented Apr 25, 2025 •

edited

Loading

liquid-helium commented May 7, 2025 •

edited

Loading

mario4tier commented May 8, 2025 •

edited

Loading

mario4tier commented May 8, 2025 •

edited

Loading