Skip to content

Commit bd27b75

Browse files
authored
ZIL: Relax parallel write ZIOs processing
ZIL introduced dependencies between its write ZIOs to permit flush defer, when we flush vdev caches only once all the write ZIOs has completed. But it was recently spotted that it serializes not only ZIO completions handling, but also their ready stage. It means ZIO pipeline can't calculate checksums for the following ZIOs until all the previous are checksumed, even though it is not required. On a systems where memory throughput of a single CPU core is limited, it creates single-core CPU bottleneck, which is difficult to see due to ZIO pipeline design with many taskqueue threads. While it would be great to bypass the ready stage waits, it would require changes to ZIO code, and I haven't found a clean way to do it. But I've noticed that we don't need any dependency between the write ZIOs if the previous one has some waiters, which means it won't defer any flushes and work as a barrier for the earlier ones. Bypassing it won't help large single-thread writes, since all the write ZIOs except the last in that case won't have waiters, and so will be dependent. But in that case the ZIO processing might not be a bottleneck, since there will be only one thread populating the write buffers, that will likely be the bottleneck. But bypassing the ZIO dependency on multi-threaded write workloads really allows them to scale beyond the checksuming throughput of one CPU core. My tests with writing 12 files on a same dataset on a pool with 4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a system with Xeon Silver 4114 CPU show total throughput increase from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to ~70%. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #17458
1 parent b4ebba0 commit bd27b75

File tree

1 file changed

+5
-2
lines changed

1 file changed

+5
-2
lines changed

module/zfs/zil.c

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1691,7 +1691,7 @@ zil_lwb_set_zio_dependency(zilog_t *zilog, lwb_t *lwb)
16911691
* If the previous lwb's write hasn't already completed, we also want
16921692
* to order the completion of the lwb write zios (above, we only order
16931693
* the completion of the lwb root zios). This is required because of
1694-
* how we can defer the flush commands for each lwb.
1694+
* how we can defer the flush commands for any lwb without waiters.
16951695
*
16961696
* When the flush commands are deferred, the previous lwb will rely on
16971697
* this lwb to flush the vdevs written to by that previous lwb. Thus,
@@ -1708,7 +1708,10 @@ zil_lwb_set_zio_dependency(zilog_t *zilog, lwb_t *lwb)
17081708
*/
17091709
if (prev_lwb->lwb_state == LWB_STATE_ISSUED) {
17101710
ASSERT3P(prev_lwb->lwb_write_zio, !=, NULL);
1711-
zio_add_child(lwb->lwb_write_zio, prev_lwb->lwb_write_zio);
1711+
if (list_is_empty(&prev_lwb->lwb_waiters)) {
1712+
zio_add_child(lwb->lwb_write_zio,
1713+
prev_lwb->lwb_write_zio);
1714+
}
17121715
} else {
17131716
ASSERT3S(prev_lwb->lwb_state, ==, LWB_STATE_WRITE_DONE);
17141717
}

0 commit comments

Comments
 (0)