-
Notifications
You must be signed in to change notification settings - Fork 399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impossible to request Raft snapshot if term
of last entry in WAL is older than current term
#551
Comments
It also seems like this check might make it impossible to use WAL truncation/compaction? Imagine you have a Raft cluster, and you truncated WAL on all nodes. Now you try to add a new node to the cluster. You can't synchronize WAL entries with We are not truncating/compacting WAL currently, so I don't really know how WAL truncation would be handled between UPD: We've run a small test, and it seems like WAL compaction is handled automatically. Nice. |
You should not truncate WAL as correctness relies on persistent logs. Raft can't work with an unreliable storage. Maybe we can relax the check in |
That would be fantastic, assuming we can do this from a lower term peer! We're trying to use snapshots to quickly set up new nodes we add to a cluster, without replaying the whole raft log entry by entry. |
This should be possible already if you compact leader logs from time to time. Also, you can consider make logs start from non-zero. For example, during initialization, there is only one peer in the cluster. You can mark the first log index as 2. When adding new peers, they are initialized with first log index 0. Therefore, they must accept snapshot instead of logs from leader. The request_snapshot API is mainly aimed for repairing local unexpected data corruption. |
Sorry, can you please clarify, what's the difference between truncation and compaction? I thought that's pretty much the same thing?
🤯🤯🤯
Starting from different index is cool idea, but if we also want to sync existing nodes (after restart), we still need |
Truncation means deletes the tail of logs, compaction means delete the head of logs. When the logs can be deleted highly depends on application. In general, once a log has been committed and applied, then it should be OK to be deleted. Just make sure
If an existing node is restarted quickly, then it should still catch up by logs. If it's lost for quite a long time, it may catch up by logs or snapshots, depends on whether leader has compact old logs. For example, if a peer is lost for a day, but leader doesn't accept any requests the whole time, then it doesn't make sense to force follower catch up by snapshot. The behavior is controlled by how you manage raft logs. It's a trade-off between disk space and replay resource. In TiKV, we prefer catch up by logs as logs are usually small. So logs are kept as long as there are any peers that has not replicated the logs. However, if logs exceeds about 1/3 the space of snapshots, then applied logs can be deleted. You may develop your own strategy that works best for your scenario. |
Got it, thanks! We just use "truncation" meaning "deletion of head of logs" in Qdrant. 😅
True. I guess, somehow we did not realize that we should just use compaction and it will solve most of our problems here. I'll experiment with doing just that then, thanks! |
I've been trying to use Raft snapshots to quickly synchronize Raft state when adding new nodes to the cluster, or restarting existing nodes, but I've discovered that there's this check in
Raft::request_snapshot
.Due to this check it's impossible to request Raft snapshot, if
term
of last entry in the WAL (of the node that tries to request a snapshot) is older than currentterm
of the cluster. And so, if there was even a single election since the node was last online, you can't request snapshots until you appended all entries from previous terms an reached an entry from current term.E.g.:
Imagine we restarted follower node after a long downtime. It currently has this WAL:
And this is WAL on leader node:
This means, that follower node has to synchronize 99k entries (indices 1001 to 100001) through regular
MsgAppend
messages, before it can request Raft snapshot to "recover" remaining 100k entries.So, my question would be: is there an explanation why this limitation required, or this check is overly restrictive/incorrect?
Seems like
RawNode::request_snapshot
documentation also hints that the check should have beenself.term >= request_index_term
(instead ofself.term == request_index_term
).The text was updated successfully, but these errors were encountered: