Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add mechanism that allows writing to the leader's disk in parallel with replication. #579

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

banks
Copy link
Member

@banks banks commented Oct 30, 2023

This PR implements an experiment and work-in-progress version of the "Writer leader's log in parallel" optimization.

I've previously discussed this optimization in #507 and #578.

In this library, a write operation must first be committed to the LogStore on the leader (including an fsync) before replication threads are notified and begin to replicate the data to followers. The write can't be applied to the FSM until at least a quorum-1 (since the leader already committed) has acknowledged that they have also committed it to their log store.

This means that every write necessarily has to wait for:

  • Disk write/fsync on leaders log
  • 1 Round trip time to follower
  • Disk write/fsync on the follower
    • Note followers replicate in parallel so it's effectively just one RTT and one disk commit time overall
  • FSM.Apply time on the leader

In Diego's full PhD thesis on raft, he points out (section 10.2.1) that it's safe to parallelize the write to the leader's disk along with the replication to followers provided that the leader doesn't mark itself as committed until it actually completes the fsync. He claims it's even safe to commit and apply to FSM before the leader has completed the fsync of its own log provided a full quorum of followers have confirmed they did!

I've thought about how to do this in our raft library for years. It's tricky because the library's concurrency design currently relies on the logs being readable from the leader's LogStore by replication threads in order to replicate to followers. Meanwhile the current contract of LogStore.StoreLogs is that the logs must be durable on disk before it returns.

I considered ways to make the new raft-wal work in an async flushing mode which is not hard but is significant work and also has the downside of limiting us to only the newer WAL storage.

Recently I had the insight that we can achieve this much more simply (and therefore I think with lower effort and risk) in the way implemented here using a single LogCacheAsync implementation that will work with any underlying synchronous LogStore.

Like the existing LogCache it maintains a circular buffer of recent log appends and serve those to reads. Unlike the existing cache though, when it is put into Async mode (i.e. when the node becomes the leader) then writes would not be proxied immediately to the underlying store but instead only added to the circular buffer and the background flusher triggered.

There is a background goroutine that flushes all the writes in the buffer that aren't yet in storage to the underlying store every time it's triggered (i.e. no delay but when writes are coming in faster than disk can flush individually it would provide another level of batching to improve throughput). After each flush, the flusher thread delivers the WriteCompletion including the index that is now stored safely to the provided channel.

The raft leader loop needs only a few small modifications to work with this scheme:

  1. When entering leader state, we check if the log store implements this new interface. If so, we create a chan for notifications of write compilations and call EnableAsync on the log store passing the chan.
  2. dispatchLogs now checks to see if the log store is async. If it is then it will use StoreLogsAsync and avoid updating commitment information after it returns.
  3. The leader loop will also wait for notifications of write completions on the created chan. When it gets one it will update the commitment information just like we do when replication confirms a remote log was persisted (i.e. this is already safe for multiple threads). From here everything else just works - once that commitment hits a quorum whether or not that includes the leader, the index is pushed to the FSM apply queue etc.

Status and Testing

This is WIP while we do further testing to verify it actually improves performance in a meaningful way.

I've included some basic unit tests and a fuzzy test that simulates partitions and shook our a few basic bugs in the implementation, giving some confidence that we are not trivially loosing data during partitions due to the async code path.

I've not yet included tests that simulate crashes and recoveries which might be more interesting in terms of validating that we are correct in terms of not considering things committed that were not persisted to disk yet. That's a TODO.

I'm reasonably confident though based on Diego's thesis, the fact that etcd does something similar and my reasoning about this librarie's commit behaviour, that this at least can be correct even if there may be bugs to iron out.

If we find the performance of this branch warrants the additional work, we can do some more thorough crash fault injection work.

@banks banks requested a review from a team as a code owner October 30, 2023 21:55
@banks banks requested review from loshz and removed request for a team October 30, 2023 21:55
@banks banks marked this pull request as draft October 30, 2023 21:56
@banks banks changed the title X/async log WIP: Add mechanism that allows writing to the leader's disk in parallel with replication. Oct 30, 2023
log_cache_async.go Outdated Show resolved Hide resolved
@banks
Copy link
Member Author

banks commented Nov 10, 2023

We found at least 3 bugs in this implementation during performance testing this week!

  1. DeleteRange doesn’t work while in async mode. I thought this wasn’t needed for some reason but it is and it results in leader unable to truncate logs after snapshot.
  2. Deadlock! If the cache fills up there is a deadlock. I’ve commented above and it’s easy to fix, I didn’t expect to hit that code path though, it’s possible I only did because of issue 3 here.
  3. It seems from the graphs we saw that something else is very wrong - we saw disk throughput skyrocket, way more than without this until it hits limit of disk and everything falls apart. It would seem like maybe the logic to update the cursors after flush is wrong and so the cache is never "emptying" and so the next flush is flushing the same data again etc. That should be easy enough to verify!

@banks
Copy link
Member Author

banks commented Nov 15, 2023

I briefly tried this branch again after the last commit but still experienced explosion in disk IO compared with sync: (async first, same workload).

image
image

So it looks like something still not quite right here - next steps would be to write tests that actually validate the right set of logs being flushed each time as I suspect that is still where the issues lies.

@banks
Copy link
Member Author

banks commented Nov 20, 2023

TODO:

  • Test/verify that this works correctly in all cases now that the leader's FSM snapshot could possibly be at a later index than it's logs after a restart. Do we still restore correctly in that case?

@banks
Copy link
Member Author

banks commented Nov 24, 2023

Another extremely subtle issue that we'd have to resolve before merging this if we choose to exists with the assumptions made during LeadershipTransfer.

Right now there is an implicit assumption that by setting the leadershipTransferInProgress flag during the leader loop before initiating a transfer, we guarantee:

  1. All logs the leader has accepted up to this point are already written to it's own log (and therefore will be accounted for when "catching up" the target new leader just by reading getLastLog.
  2. All logs accepted by the leader loop after this point (up until the transfer completes and we either step down or resume normal operation) will be immediately failed with ErrLeadershipTransferInProgress.

At first, I thought this PR would violate that assumption because the leader loop might move on before logs were persisted and start a transfer assuming that no concurrent log appends are happening.

After careful thought, I think this is actually not an issue. Here's my reasoning:

  1. Regardless of whether they fsynced we still call setLastLog with the highest log index before we return from dispatchLogs
  2. So a leadership transfer might be concurrent with the actual disk sync (i.e. the leader might not have finished committing the logs yet), but it can't start persisting more because of the existing check that will cause further log appends to error with ErrLeadershipTransferInProgress
  3. leadershipTransfer will replicate up to the last log index that was set, regardless of whether that log is durable on disk or not - the async log store guarantees to return it again to a Get even before it's durable.

So the transfer will behave the same way as now, even though it's possible that the follower will end up with a more up to date log (at least durably on disk) than the current leader by the time we issue the TimeoutNow RPC to force it to hold an election, it can't be less up-to-date.

@banks
Copy link
Member Author

banks commented Oct 7, 2024

I think I found the bug! Although we were now correctly updating the persistent index after my last fix, we still weren't clearing the batch slice so were still re-writing the same logs over and over. 🤦

A monotonic log store would have errored but I testing in Vault with BoltDB which is not monotonic and so will happily store the same logs over and over again with each write.

I've updated the tests to prove this:

  1. The in-mem store is now (optionally) monotonic to catch this
  2. I realised the test was not correctly exercising completions to detect errors!
  3. I realised the test is non-deterministic and timing sensitive as to how many flushes occur so now it allows for any valid combination of flushes

@banks
Copy link
Member Author

banks commented Oct 7, 2024

I think this is OK for perf testing. One TODO if we choose to take it forward:

  • Properly test the buffer overflow logic and that we don't loose unpersisted writes if we wrap around.
  • Test behaviour with non-monitonic log writes is correct since we support non-monotonic log stores. In practice we only support there being a "hole" in the logs after a snapshot restore, but makes sure that we don't end up messing up the wring buffer state by leaving an old log in the hole or by not leaving a hole and getting buffer index and log index out of whack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant