Impossible to request Raft snapshot if `term` of last entry in WAL is older than current `term` #551

ffuugoo · 2024-10-08T16:06:38Z

I've been trying to use Raft snapshots to quickly synchronize Raft state when adding new nodes to the cluster, or restarting existing nodes, but I've discovered that there's this check in Raft::request_snapshot.

Due to this check it's impossible to request Raft snapshot, if term of last entry in the WAL (of the node that tries to request a snapshot) is older than current term of the cluster. And so, if there was even a single election since the node was last online, you can't request snapshots until you appended all entries from previous terms an reached an entry from current term.

E.g.:

Imagine we restarted follower node after a long downtime. It currently has this WAL:

...
term: 42, index: 1000

And this is WAL on leader node:

...
term: 42, index: 1000,
...
term: 42, index: 100000,
term: 43, index: 100001, // election happended here
...
term: 43, index: 200000,

This means, that follower node has to synchronize 99k entries (indices 1001 to 100001) through regular MsgAppend messages, before it can request Raft snapshot to "recover" remaining 100k entries.

So, my question would be: is there an explanation why this limitation required, or this check is overly restrictive/incorrect?

Seems like RawNode::request_snapshot documentation also hints that the check should have been self.term >= request_index_term (instead of self.term == request_index_term).

The text was updated successfully, but these errors were encountered:

ffuugoo · 2024-10-08T16:15:58Z

It also seems like this check might make it impossible to use WAL truncation/compaction?

Imagine you have a Raft cluster, and you truncated WAL on all nodes. Now you try to add a new node to the cluster. You can't synchronize WAL entries with MsgAppend messages, because the WAL is truncated. And you can't request Raft snapshot on the new node, because its WAL is empty, and so the term of last entry in the WAL is 0/error.

We are not truncating/compacting WAL currently, so I don't really know how WAL truncation would be handled between Storage trait impl and raft-rs crate. Maybe Raft leader just sends Raft snapshot automatically if its WAL is truncated, without explicit request from the follower? Anyway, I think this is worth noting here.

UPD: We've run a small test, and it seems like WAL compaction is handled automatically. Nice.

BusyJay · 2024-10-09T04:02:06Z

You should not truncate WAL as correctness relies on persistent logs. Raft can't work with an unreliable storage.

Maybe we can relax the check in request_snapshot and allow caller to choose whether request snapshot of current term. /cc @Connor1996

timvisee · 2024-10-09T07:30:21Z

Maybe we can relax the check in request_snapshot and allow caller to choose whether request snapshot of current term.

That would be fantastic, assuming we can do this from a lower term peer! We're trying to use snapshots to quickly set up new nodes we add to a cluster, without replaying the whole raft log entry by entry.

BusyJay · 2024-10-09T07:36:15Z

This should be possible already if you compact leader logs from time to time.

Also, you can consider make logs start from non-zero. For example, during initialization, there is only one peer in the cluster. You can mark the first log index as 2. When adding new peers, they are initialized with first log index 0. Therefore, they must accept snapshot instead of logs from leader.

The request_snapshot API is mainly aimed for repairing local unexpected data corruption.

ffuugoo · 2024-10-09T07:42:54Z

This should be possible already if you compact leader logs from time to time.

Sorry, can you please clarify, what's the difference between truncation and compaction? I thought that's pretty much the same thing?

Also, you can consider make logs start from non-zero. For example, during initialization, there is only one peer in the cluster. You can mark the first log index as 2. When adding new peers, they are initialized with first log index 0. Therefore, they must accept snapshot instead of logs from leader.

🤯🤯🤯

The request_snapshot API is mainly aimed for repairing local unexpected data corruption.

Starting from different index is cool idea, but if we also want to sync existing nodes (after restart), we still need request_snapshot, right?

BusyJay · 2024-10-09T07:59:25Z

what's the difference between truncation and compaction

Truncation means deletes the tail of logs, compaction means delete the head of logs. When the logs can be deleted highly depends on application. In general, once a log has been committed and applied, then it should be OK to be deleted. Just make sure Storage::first_index returns a correct index.

but if we also want to sync existing nodes (after restart)

If an existing node is restarted quickly, then it should still catch up by logs. If it's lost for quite a long time, it may catch up by logs or snapshots, depends on whether leader has compact old logs. For example, if a peer is lost for a day, but leader doesn't accept any requests the whole time, then it doesn't make sense to force follower catch up by snapshot.

The behavior is controlled by how you manage raft logs. It's a trade-off between disk space and replay resource. In TiKV, we prefer catch up by logs as logs are usually small. So logs are kept as long as there are any peers that has not replicated the logs. However, if logs exceeds about 1/3 the space of snapshots, then applied logs can be deleted. You may develop your own strategy that works best for your scenario.

ffuugoo · 2024-10-09T08:10:49Z

Truncation means deletes the tail of logs, compaction means delete the head of logs...

Got it, thanks! We just use "truncation" meaning "deletion of head of logs" in Qdrant. 😅

If an existing node is restarted quickly, then it should still catch up by logs...

True. I guess, somehow we did not realize that we should just use compaction and it will solve most of our problems here. I'll experiment with doing just that then, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to request Raft snapshot if `term` of last entry in WAL is older than current `term` #551

Impossible to request Raft snapshot if `term` of last entry in WAL is older than current `term` #551

ffuugoo commented Oct 8, 2024

ffuugoo commented Oct 8, 2024 •

edited

Loading

BusyJay commented Oct 9, 2024

timvisee commented Oct 9, 2024

BusyJay commented Oct 9, 2024

ffuugoo commented Oct 9, 2024

BusyJay commented Oct 9, 2024

ffuugoo commented Oct 9, 2024 •

edited

Loading

Impossible to request Raft snapshot if term of last entry in WAL is older than current term #551

Impossible to request Raft snapshot if term of last entry in WAL is older than current term #551

Comments

ffuugoo commented Oct 8, 2024

ffuugoo commented Oct 8, 2024 • edited Loading

BusyJay commented Oct 9, 2024

timvisee commented Oct 9, 2024

BusyJay commented Oct 9, 2024

ffuugoo commented Oct 9, 2024

BusyJay commented Oct 9, 2024

ffuugoo commented Oct 9, 2024 • edited Loading

Impossible to request Raft snapshot if `term` of last entry in WAL is older than current `term` #551

Impossible to request Raft snapshot if `term` of last entry in WAL is older than current `term` #551

ffuugoo commented Oct 8, 2024 •

edited

Loading

ffuugoo commented Oct 9, 2024 •

edited

Loading