You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the CRC for each WAL chunk (a.k.a fragment) is computed as the chunk is emitted to a WAL block. This CRC computation is done while holding commitPipeline.mu which periodically shows up in mutex profiles. Prior profiling indicates that the CRC computation is ~1/3 of the CPU during commitPipeline.prepare. We could move the CRC computation out of commitPipeline.mu by not performing it during LogWriter.emitFragment* and instead perform it somewhere within LogWriter.flushLoop. Before writing a WAL block, or partial WAL block, to the WAL file we'd iterate over the fragments being written and populate the CRC. This iteration is straightforward as the fragments are tightly packed in the WAL block and are self-describing:
It isn't clear that this refactoring will be a win as performing the CRC computation on the flush goroutine will compete with time spent performing I/O. A quick experiment to see if this may be worthwhile would be to benchmark disabling the CRC computation in LogWriter.emitFragment*.
Quite a bit more radically than the above idea, we could move the memcpy of the batch representation out from under commitPipeline.mu. Once we know the offset within the WAL where a batch will be written, the number of fragments the batch will be broken into is deterministic based on the batch size. And the location of the next batch in the WAL can also be calculated deterministically. The sketch of what we could do is to have db.commitWrite return the offset of where the batch will be written to the WAL, release commitPipeline.mu, and then call back into LogWriter to actually stage the batch into fragments. If I squint this also seems possible, though quite complicated. We'd need to have the flush loop wait until the prefix of the WAL block it is trying to flush has been fully staged. The synchronization to make this work could be a non-starter.
I attempted the quick experiment to disable the CRC computation and couldn't measure a perf difference on the kv0/enc=false/nodes=1/cpu=32 roachtest across 10 runs with and without this change. I did verify that the disabling was done correctly by looking at a CPU profile. I chose this roachtest out of a guess that it would be the most sensitive to a change in this area. I also tried completely disabling the memcpy of the batch into the WAL (so effectively we're writing zeroes for the WAL). Again, no measurable perf difference.
Certainly possible I did something wrong in this testing, but for now the TLDR is "nothing to see here".
Scalability of write-ahead logging on multicore/multisocket hardware. Section 5 (Scalable log buffer design for multicore) is interesting as the “baseline” design is more or less what Pebble/RocksDB are doing: a single mutex protects the entirety of the addition of a log record to the WAL. The solution is something was wondering about above: reserving space in the WAL buffer, allowing writes to memcpy to the buffer in parallel, and then releasing the buffer reservations in the same order they were acquired.
Currently, the CRC for each WAL chunk (a.k.a fragment) is computed as the chunk is emitted to a WAL block. This CRC computation is done while holding
commitPipeline.mu
which periodically shows up in mutex profiles. Prior profiling indicates that the CRC computation is ~1/3 of the CPU duringcommitPipeline.prepare
. We could move the CRC computation out ofcommitPipeline.mu
by not performing it duringLogWriter.emitFragment*
and instead perform it somewhere withinLogWriter.flushLoop
. Before writing a WAL block, or partial WAL block, to the WAL file we'd iterate over the fragments being written and populate the CRC. This iteration is straightforward as the fragments are tightly packed in the WAL block and are self-describing:It isn't clear that this refactoring will be a win as performing the CRC computation on the flush goroutine will compete with time spent performing I/O. A quick experiment to see if this may be worthwhile would be to benchmark disabling the CRC computation in
LogWriter.emitFragment*
.Jira issue: PEBBLE-368
The text was updated successfully, but these errors were encountered: