Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to request Raft snapshot if term of last entry in WAL is older than current term #551

Open
ffuugoo opened this issue Oct 8, 2024 · 7 comments

Comments

@ffuugoo
Copy link

ffuugoo commented Oct 8, 2024

I've been trying to use Raft snapshots to quickly synchronize Raft state when adding new nodes to the cluster, or restarting existing nodes, but I've discovered that there's this check in Raft::request_snapshot.

Due to this check it's impossible to request Raft snapshot, if term of last entry in the WAL (of the node that tries to request a snapshot) is older than current term of the cluster. And so, if there was even a single election since the node was last online, you can't request snapshots until you appended all entries from previous terms an reached an entry from current term.

E.g.:

Imagine we restarted follower node after a long downtime. It currently has this WAL:

...
term: 42, index: 1000

And this is WAL on leader node:

...
term: 42, index: 1000,
...
term: 42, index: 100000,
term: 43, index: 100001, // election happended here
...
term: 43, index: 200000,

This means, that follower node has to synchronize 99k entries (indices 1001 to 100001) through regular MsgAppend messages, before it can request Raft snapshot to "recover" remaining 100k entries.

So, my question would be: is there an explanation why this limitation required, or this check is overly restrictive/incorrect?

Seems like RawNode::request_snapshot documentation also hints that the check should have been self.term >= request_index_term (instead of self.term == request_index_term).

@ffuugoo
Copy link
Author

ffuugoo commented Oct 8, 2024

It also seems like this check might make it impossible to use WAL truncation/compaction?

Imagine you have a Raft cluster, and you truncated WAL on all nodes. Now you try to add a new node to the cluster. You can't synchronize WAL entries with MsgAppend messages, because the WAL is truncated. And you can't request Raft snapshot on the new node, because its WAL is empty, and so the term of last entry in the WAL is 0/error.

We are not truncating/compacting WAL currently, so I don't really know how WAL truncation would be handled between Storage trait impl and raft-rs crate. Maybe Raft leader just sends Raft snapshot automatically if its WAL is truncated, without explicit request from the follower? Anyway, I think this is worth noting here.

UPD: We've run a small test, and it seems like WAL compaction is handled automatically. Nice.

@BusyJay
Copy link
Member

BusyJay commented Oct 9, 2024

You should not truncate WAL as correctness relies on persistent logs. Raft can't work with an unreliable storage.

Maybe we can relax the check in request_snapshot and allow caller to choose whether request snapshot of current term. /cc @Connor1996

@timvisee
Copy link

timvisee commented Oct 9, 2024

Maybe we can relax the check in request_snapshot and allow caller to choose whether request snapshot of current term.

That would be fantastic, assuming we can do this from a lower term peer! We're trying to use snapshots to quickly set up new nodes we add to a cluster, without replaying the whole raft log entry by entry.

@BusyJay
Copy link
Member

BusyJay commented Oct 9, 2024

This should be possible already if you compact leader logs from time to time.

Also, you can consider make logs start from non-zero. For example, during initialization, there is only one peer in the cluster. You can mark the first log index as 2. When adding new peers, they are initialized with first log index 0. Therefore, they must accept snapshot instead of logs from leader.

The request_snapshot API is mainly aimed for repairing local unexpected data corruption.

@ffuugoo
Copy link
Author

ffuugoo commented Oct 9, 2024

This should be possible already if you compact leader logs from time to time.

Sorry, can you please clarify, what's the difference between truncation and compaction? I thought that's pretty much the same thing?

Also, you can consider make logs start from non-zero. For example, during initialization, there is only one peer in the cluster. You can mark the first log index as 2. When adding new peers, they are initialized with first log index 0. Therefore, they must accept snapshot instead of logs from leader.

🤯🤯🤯

The request_snapshot API is mainly aimed for repairing local unexpected data corruption.

Starting from different index is cool idea, but if we also want to sync existing nodes (after restart), we still need request_snapshot, right?

@BusyJay
Copy link
Member

BusyJay commented Oct 9, 2024

what's the difference between truncation and compaction

Truncation means deletes the tail of logs, compaction means delete the head of logs. When the logs can be deleted highly depends on application. In general, once a log has been committed and applied, then it should be OK to be deleted. Just make sure Storage::first_index returns a correct index.

but if we also want to sync existing nodes (after restart)

If an existing node is restarted quickly, then it should still catch up by logs. If it's lost for quite a long time, it may catch up by logs or snapshots, depends on whether leader has compact old logs. For example, if a peer is lost for a day, but leader doesn't accept any requests the whole time, then it doesn't make sense to force follower catch up by snapshot.

The behavior is controlled by how you manage raft logs. It's a trade-off between disk space and replay resource. In TiKV, we prefer catch up by logs as logs are usually small. So logs are kept as long as there are any peers that has not replicated the logs. However, if logs exceeds about 1/3 the space of snapshots, then applied logs can be deleted. You may develop your own strategy that works best for your scenario.

@ffuugoo
Copy link
Author

ffuugoo commented Oct 9, 2024

Truncation means deletes the tail of logs, compaction means delete the head of logs...

Got it, thanks! We just use "truncation" meaning "deletion of head of logs" in Qdrant. 😅

If an existing node is restarted quickly, then it should still catch up by logs...

True. I guess, somehow we did not realize that we should just use compaction and it will solve most of our problems here. I'll experiment with doing just that then, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants