-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add mechanism that allows writing to the leader's disk in parallel with replication. #579
base: main
Are you sure you want to change the base?
Conversation
We found at least 3 bugs in this implementation during performance testing this week!
|
I briefly tried this branch again after the last commit but still experienced explosion in disk IO compared with sync: (async first, same workload). So it looks like something still not quite right here - next steps would be to write tests that actually validate the right set of logs being flushed each time as I suspect that is still where the issues lies. |
TODO:
|
Another extremely subtle issue that we'd have to resolve before merging this if we choose to exists with the assumptions made during LeadershipTransfer. Right now there is an implicit assumption that by setting the
At first, I thought this PR would violate that assumption because the leader loop might move on before logs were persisted and start a transfer assuming that no concurrent log appends are happening. After careful thought, I think this is actually not an issue. Here's my reasoning:
So the transfer will behave the same way as now, even though it's possible that the follower will end up with a more up to date log (at least durably on disk) than the current leader by the time we issue the |
I think I found the bug! Although we were now correctly updating the persistent index after my last fix, we still weren't clearing the A I've updated the tests to prove this:
|
I think this is OK for perf testing. One TODO if we choose to take it forward:
|
This PR implements an experiment and work-in-progress version of the "Writer leader's log in parallel" optimization.
I've previously discussed this optimization in #507 and #578.
In this library, a write operation must first be committed to the LogStore on the leader (including an fsync) before replication threads are notified and begin to replicate the data to followers. The write can't be applied to the FSM until at least a quorum-1 (since the leader already committed) has acknowledged that they have also committed it to their log store.
This means that every write necessarily has to wait for:
In Diego's full PhD thesis on raft, he points out (section 10.2.1) that it's safe to parallelize the write to the leader's disk along with the replication to followers provided that the leader doesn't mark itself as committed until it actually completes the fsync. He claims it's even safe to commit and apply to FSM before the leader has completed the fsync of its own log provided a full quorum of followers have confirmed they did!
I've thought about how to do this in our raft library for years. It's tricky because the library's concurrency design currently relies on the logs being readable from the leader's
LogStore
by replication threads in order to replicate to followers. Meanwhile the current contract ofLogStore.StoreLogs
is that the logs must be durable on disk before it returns.I considered ways to make the new
raft-wal
work in an async flushing mode which is not hard but is significant work and also has the downside of limiting us to only the newer WAL storage.Recently I had the insight that we can achieve this much more simply (and therefore I think with lower effort and risk) in the way implemented here using a single
LogCacheAsync
implementation that will work with any underlying synchronousLogStore
.Like the existing
LogCache
it maintains a circular buffer of recent log appends and serve those to reads. Unlike the existing cache though, when it is put into Async mode (i.e. when the node becomes the leader) then writes would not be proxied immediately to the underlying store but instead only added to the circular buffer and the background flusher triggered.There is a background goroutine that flushes all the writes in the buffer that aren't yet in storage to the underlying store every time it's triggered (i.e. no delay but when writes are coming in faster than disk can flush individually it would provide another level of batching to improve throughput). After each flush, the flusher thread delivers the WriteCompletion including the index that is now stored safely to the provided channel.
The raft leader loop needs only a few small modifications to work with this scheme:
Status and Testing
This is WIP while we do further testing to verify it actually improves performance in a meaningful way.
I've included some basic unit tests and a
fuzzy
test that simulates partitions and shook our a few basic bugs in the implementation, giving some confidence that we are not trivially loosing data during partitions due to the async code path.I've not yet included tests that simulate crashes and recoveries which might be more interesting in terms of validating that we are correct in terms of not considering things committed that were not persisted to disk yet. That's a TODO.
I'm reasonably confident though based on Diego's thesis, the fact that etcd does something similar and my reasoning about this librarie's commit behaviour, that this at least can be correct even if there may be bugs to iron out.
If we find the performance of this branch warrants the additional work, we can do some more thorough crash fault injection work.