Skip to content

feat: checkpoint Walrus RocksDB database #2020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 30, 2025
Merged

feat: checkpoint Walrus RocksDB database #2020

merged 22 commits into from
May 30, 2025

Conversation

liquid-helium
Copy link
Contributor

@liquid-helium liquid-helium commented Apr 25, 2025

Description

  • Added a CheckpointManager component, which provides
    • Periodic checkpoint of the main database
    • Interface for manual checkpoint
  • Introduced a node admin socket so that operators can interact directly with a running node
    • Added Checkpoint commands in walrus-node command tool

contributes to WAL-842
Contributes to #1920

Test plan

How did you test the new or updated feature?


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)

  • Storage node:
  • Added a CheckpointManager component, which provides
    • Periodic checkpoint of the main database
    • Interface for manual checkpoint
  • Introduced a node admin socket so that operators can interact directly with a running node
    • Added Checkpoint commands in walrus-node command tool
  • Aggregator:
  • Publisher:
  • CLI:

Copy link
Contributor

Warning: This PR modifies one of the example config files. Please consider the
following:

  • Make sure the changes are backwards compatible with the current configuration.
  • Make sure any added parameters follow the conventions of the existing parameters; in
    particular, durations should take seconds or milliseconds using the naming convention
    _secs or _millis, respectively.
  • If there are added optional parameter sections, it should be possible to specify them
    partially. A useful pattern there is to implement Default for the struct and derive
    #[serde(default)] on it, see BlobRecoveryConfig as an example.
  • You may need to update the documentation to reflect the changes.

@liquid-helium liquid-helium changed the title Backup restore feat: checkpoint Walrus RocksDB database Apr 26, 2025
Copy link
Collaborator

@mlegner mlegner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this, @liquid-helium! I have two initial high-level comments. As I'm not too familiar with RocksDB checkpoints, nor with UNIX sockets in Rust, I'd prefer if somebody else would take a closer look at the details there.

@sadhansood
Copy link
Contributor

Thanks @liquid-helium for this change! I have one question - what happens to the checkpoints that we create periodically or on-demand? Wouldn't it increase the disk usage once created if not copied out to some archive/cold storage solution as time goes on and increase operational burden. One way we dealt with this in Sui is in db_checkpoint_handler.rs where we also manage the second part of moving the local db into a remote store once it is backed up.

@liquid-helium
Copy link
Contributor Author

Thanks @liquid-helium for this change! I have one question - what happens to the checkpoints that we create periodically or on-demand? Wouldn't it increase the disk usage once created if not copied out to some archive/cold storage solution as time goes on and increase operational burden. One way we dealt with this in Sui is in db_checkpoint_handler.rs where we also manage the second part of moving the local db into a remote store once it is backed up.

Thanks @sadhansood for raising this up, that's a good example to follow if we decide to do the upload of backup :)
The disk usage is definitely something we need to consider, and storing checkpoint in the same storage is a counter-pattern.
In short, I am currently thinking the operators add a local disk or some other external storage that can be accessed directly via a path, and the checkpoint will be created directly in that path. Only a number of checkpoints are kept (currently default to 3).

@sadhansood
Copy link
Contributor

Makes sense to do it on a different storage locally. I think we should add documentation on how to set this up. But otherwise, the PR looks good and I will try to review in more detail next (I do have some questions on incremental backup as well as restores). It might be nice to set up a meeting to discuss this with some of the team members since it is a bigger change overall. Thanks @liquid-helium!

@mario4tier
Copy link

... The disk usage is definitely something we need to consider, and storing checkpoint in the same storage is a counter-pattern.

In our case, creating fast and cheap hard link checkpoints on same storage would be preferred. A DB checkpoint would be followed by a ZFS snapshot of the whole storage (data+checkpoints).

The backup to a separate storage would then be the node operator responsibility (ZFS send/receive in our case).

My point being: consider keeping the feature of creating/recovering from hard links checkpoints ... and develop remote backup as a separate optional step/feature.

(seems you are already "there", just sharing my use case)

@liquid-helium
Copy link
Contributor Author

liquid-helium commented May 7, 2025

... The disk usage is definitely something we need to consider, and storing checkpoint in the same storage is a counter-pattern.

In our case, creating fast and cheap hard link checkpoints on same storage would be preferred. A DB checkpoint would be followed by a ZFS snapshot of the whole storage (data+checkpoints).

The backup to a separate storage would then be the node operator responsibility (ZFS send/receive in our case).

My point being: consider keeping the feature of creating/recovering from hard links checkpoints ... and develop remote backup as a separate optional step/feature.

(seems you are already "there", just sharing my use case)

Hi @mario4tier, thank you for your comments—it's a great suggestion. Feel free to create a separate issue to track this and assign it to me. The current implementation focuses on scenarios where the disk is damaged, using BackupEngine for recovery. However, I agree a lightweight checkpoint using hard links could address other scenarios. I'll look into that based on priority. Thanks!

@mario4tier
Copy link

mario4tier commented May 8, 2025

Feel free to create a separate issue to track this and assign it to me.

I felt that implementing DB checkpoints management (e.g. trigger, recovery...) is a common layer (for all use case) upon which the additional remote backup features would be built.

So I was hopeful that "relatively simpler" hard link checkpoints management could be exposed on the CLI as part of (or even before) doing the "more complicated" remote backup 😄

Not sure if I am the right person to create another PR (in addition of #1920). I prefer to leave this to you as you have more insight on how it is to be implemented.

@mario4tier
Copy link

| The current implementation focuses on scenarios where the disk is damaged, using BackupEngine for recovery.

Good, can't dispute the value of backups 😄

The rest is my opinion about priority of implementing hard link checkpoints versus remote backup:

Worst-case scenario

  • DB corruptions can have a bigger impact than drive failures. The most dramatic would be a walrus-node bug/upgrade that would affect multiple operators "at once".

  • There were multiple instances of DB corruptions, likely due to walrus-node crash/kill, but not fully understand (AFAIK).

  • Drive failures are already mitigated (to different degree) by many operators. It sucks when it happens, but it has a low probability of global impact.

Hard-links checkpoint is the "cheapest" solution against the "worst" scenario that we do not fully control/understand.

And "cheap" means also a better chance of being actually enabled by most operators on short term (if one did not bother already with RAID-like protection on live network, then lower chance it will invest into costly remote backup soon).

===

Just to be clear, I expect most responsible operator to want, and eventually use remote backups.

@liquid-helium
Copy link
Contributor Author

  • rruptions can

I agree that a checkpoint is a cost-effective and efficient way to prevent many corruption issues. I considered both Checkpoint and Backup options, aiming to use only one to protect against corruption instead of maintaining two separate features. (We can implement remote backups on top of checkpoints later, but I prefer using RocksDB's native backup engine to manage 'shared files' when multiple backups are involved).

The reason I chose BackupEngine is primarily because it provides a comprehensive, all-in-one solution—despite the cost of copying data to an external disk—and second, because of concerns about space usage from checkpoints, especially in worst-case scenarios where many SST files are compacted after creating a checkpoint.

That said, I definitely see the value of a checkpoint from an operator's perspective. A CLI to create checkpoints before upgrades is a particularly good idea, and I appreciate your suggestion.

I will prioritize the checkpoint feature after completing this one.

@mario4tier
Copy link

mario4tier commented May 8, 2025

Thanks for doing all this!

Details
In time, as you see fit, I can share measurements of daily data "turnover" (by design, this is easy to monitor with daily ZFS snapshots). That includes the cost of keeping a reference on "older versions" of the SST files.

In short, for our 21 mainnet shards, it currently costs 100GB to 300GB per day to keep holding a snapshot of the whole walrus/db directory... with the highest numbers when a compaction occurred (as you expected).

Copy link
Contributor

@halfprice halfprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @liquid-helium and this looks amazing!

I primarily looked at the execution logic, and I only have some nits with one important comment that we should put all RocksDb operation on a blocking thread. Other than that, LGTM!

I don't really know much detail about RocksDb backup creation so I'll rely on @sadhansood to review that part.

Copy link
Contributor

@sadhansood sadhansood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liquid-helium, this is a solid PR for backing up walrus data!

Copy link
Contributor

@sadhansood sadhansood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @liquid-helium, thanks for your amazing work on this!

Copy link
Contributor

@halfprice halfprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀🚀🚀

@liquid-helium liquid-helium merged commit b6f6d6c into main May 30, 2025
26 checks passed
@liquid-helium liquid-helium deleted the backup-restore branch May 30, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants