Skip to content

record: experiment with moving WAL chunk CRC computation to the flush goroutine #4431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
petermattis opened this issue Mar 27, 2025 · 3 comments

Comments

@petermattis
Copy link
Collaborator

petermattis commented Mar 27, 2025

Currently, the CRC for each WAL chunk (a.k.a fragment) is computed as the chunk is emitted to a WAL block. This CRC computation is done while holding commitPipeline.mu which periodically shows up in mutex profiles. Prior profiling indicates that the CRC computation is ~1/3 of the CPU during commitPipeline.prepare. We could move the CRC computation out of commitPipeline.mu by not performing it during LogWriter.emitFragment* and instead perform it somewhere within LogWriter.flushLoop. Before writing a WAL block, or partial WAL block, to the WAL file we'd iterate over the fragments being written and populate the CRC. This iteration is straightforward as the fragments are tightly packed in the WAL block and are self-describing:

+----------+-----------+-----------+----------------+--- ... ---+----------+-----+
| CRC (4B) | Size (2B) | Type (1B) | Log number (4B)| Payload   | CRC (4B) | ... |
+----------+-----------+-----------+----------------+--- ... ---+----------+-----+

It isn't clear that this refactoring will be a win as performing the CRC computation on the flush goroutine will compete with time spent performing I/O. A quick experiment to see if this may be worthwhile would be to benchmark disabling the CRC computation in LogWriter.emitFragment*.

Jira issue: PEBBLE-368

@petermattis
Copy link
Collaborator Author

Quite a bit more radically than the above idea, we could move the memcpy of the batch representation out from under commitPipeline.mu. Once we know the offset within the WAL where a batch will be written, the number of fragments the batch will be broken into is deterministic based on the batch size. And the location of the next batch in the WAL can also be calculated deterministically. The sketch of what we could do is to have db.commitWrite return the offset of where the batch will be written to the WAL, release commitPipeline.mu, and then call back into LogWriter to actually stage the batch into fragments. If I squint this also seems possible, though quite complicated. We'd need to have the flush loop wait until the prefix of the WAL block it is trying to flush has been fully staged. The synchronization to make this work could be a non-starter.

@petermattis
Copy link
Collaborator Author

I attempted the quick experiment to disable the CRC computation and couldn't measure a perf difference on the kv0/enc=false/nodes=1/cpu=32 roachtest across 10 runs with and without this change. I did verify that the disabling was done correctly by looking at a CPU profile. I chose this roachtest out of a guess that it would be the most sensitive to a change in this area. I also tried completely disabling the memcpy of the batch into the WAL (so effectively we're writing zeroes for the WAL). Again, no measurable perf difference.

Certainly possible I did something wrong in this testing, but for now the TLDR is "nothing to see here".

@petermattis
Copy link
Collaborator Author

Scalability of write-ahead logging on multicore/multisocket hardware. Section 5 (Scalable log buffer design for multicore) is interesting as the “baseline” design is more or less what Pebble/RocksDB are doing: a single mutex protects the entirety of the addition of a log record to the WAL. The solution is something was wondering about above: reserving space in the WAL buffer, allowing writes to memcpy to the buffer in parallel, and then releasing the buffer reservations in the same order they were acquired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant