Skip to content

L2ARC: Add depth cap and write budget fairness for persistent markers#18289

Closed
ixhamza wants to merge 6 commits intoopenzfs:masterfrom
truenas:NAS-139817
Closed

L2ARC: Add depth cap and write budget fairness for persistent markers#18289
ixhamza wants to merge 6 commits intoopenzfs:masterfrom
truenas:NAS-139817

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Mar 5, 2026

Motivation and Context

Follow-up to #18093. With persistent markers, scan positions drift indefinitely toward the head of ARC eviction lists where buffers will stay in ARC for a while, writing them to L2ARC adds little value since ARC already serves them. The tail is where buffers are closest to eviction and benefit most from L2ARC. Additionally, when eviction outpaces L2ARC write throughput, metadata passes run first and can fill the entire write budget every cycle, starving data passes of buffers that could have produced L2ARC hits.

Description

  • Even sublist headroom distribution: Divide headroom equally across sublists with round-robin visitation to prevent any single sublist from dominating the write budget. - Lazy sublist reset flags: Signal marker resets via per-sublist boolean flags instead of direct manipulation, consumed at scan start and end. Decouples reset signaling from active scans.
  • Scan-based depth cap: Track cumulative bytes scanned per pass. Reset markers to tail when scanning exceeds l2arc_ext_headroom_pct (default 25%) of ARC state size. Keeps markers in the tail zone where L2ARC adds the most value. Set to 0 to disable.
  • Write budget fairness: After l2arc_meta_cycles (default 2) consecutive cycles where metadata fills the budget, skip metadata for one cycle to let data run. Only triggers when data states have buffers to write. Set to 0 to disable.
  • Man page updates: Remove stale "inclusive caching" terminology and document new tunables.

How Has This Been Tested?

  • CI Testing
  • Manual unit tests.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 6, 2026
ixhamza added 6 commits March 9, 2026 19:27
The dynamic headroom redistribution formula gave later sublists
progressively larger scanning budgets, and random sublist selection
caused uneven coverage across sublists. For depth cap to work
effectively, each sublist should be equally and fairly treated.
Use equal per-sublist headroom (headroom / num_sublists) for even
distribution and deterministic round-robin selection for fair
coverage across cycles.

Signed-off-by: Ameer Hamza <[email protected]>
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads.  Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.

Signed-off-by: Ameer Hamza <[email protected]>
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.

Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags. This keeps markers
focused on the tail zone where buffers soon to be evicted have the most
value for L2ARC.

Signed-off-by: Ameer Hamza <[email protected]>
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run.  Reset
the counter when nothing is written.

Signed-off-by: Ameer Hamza <[email protected]>
prev_hdr is dereferenced after the sublist lock is dropped for write I/O
but nothing prevents it from being freed during that window. Eliminate
prev_hdr entirely and simplify persistent marker repositioning logic.

Signed-off-by: Ameer Hamza <[email protected]>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 9, 2026
behlendorf pushed a commit that referenced this pull request Mar 10, 2026
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads.  Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #18289
behlendorf pushed a commit that referenced this pull request Mar 10, 2026
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.

Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags. This keeps markers
focused on the tail zone where buffers soon to be evicted have the most
value for L2ARC.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #18289
behlendorf pushed a commit that referenced this pull request Mar 10, 2026
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run.  Reset
the counter when nothing is written.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #18289
behlendorf pushed a commit that referenced this pull request Mar 10, 2026
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #18289
behlendorf pushed a commit that referenced this pull request Mar 10, 2026
prev_hdr is dereferenced after the sublist lock is dropped for write I/O
but nothing prevents it from being freed during that window. Eliminate
prev_hdr entirely and simplify persistent marker repositioning logic.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes #18289
ixhamza added a commit to truenas/zfs that referenced this pull request Mar 10, 2026
The dynamic headroom redistribution formula gave later sublists
progressively larger scanning budgets, and random sublist selection
caused uneven coverage across sublists. For depth cap to work
effectively, each sublist should be equally and fairly treated.
Use equal per-sublist headroom (headroom / num_sublists) for even
distribution and deterministic round-robin selection for fair
coverage across cycles.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
ixhamza added a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads.  Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
ixhamza added a commit to truenas/zfs that referenced this pull request Mar 10, 2026
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.

Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags. This keeps markers
focused on the tail zone where buffers soon to be evicted have the most
value for L2ARC.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
ixhamza added a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run.  Reset
the counter when nothing is written.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
ixhamza added a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
ixhamza added a commit to truenas/zfs that referenced this pull request Mar 10, 2026
prev_hdr is dereferenced after the sublist lock is dropped for write I/O
but nothing prevents it from being freed during that window. Eliminate
prev_hdr entirely and simplify persistent marker repositioning logic.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
The dynamic headroom redistribution formula gave later sublists
progressively larger scanning budgets, and random sublist selection
caused uneven coverage across sublists. For depth cap to work
effectively, each sublist should be equally and fairly treated.
Use equal per-sublist headroom (headroom / num_sublists) for even
distribution and deterministic round-robin selection for fair
coverage across cycles.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 61423b5)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads.  Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 4afadbc)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.

Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags. This keeps markers
focused on the tail zone where buffers soon to be evicted have the most
value for L2ARC.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 2d85830)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run.  Reset
the counter when nothing is written.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit b5172ee)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 8965fff)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
prev_hdr is dereferenced after the sublist lock is dropped for write I/O
but nothing prevents it from being freed during that window. Eliminate
prev_hdr entirely and simplify persistent marker repositioning logic.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 66a8726)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
The dynamic headroom redistribution formula gave later sublists
progressively larger scanning budgets, and random sublist selection
caused uneven coverage across sublists. For depth cap to work
effectively, each sublist should be equally and fairly treated.
Use equal per-sublist headroom (headroom / num_sublists) for even
distribution and deterministic round-robin selection for fair
coverage across cycles.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 61423b5)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads.  Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 4afadbc)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.

Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags. This keeps markers
focused on the tail zone where buffers soon to be evicted have the most
value for L2ARC.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 2d85830)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run.  Reset
the counter when nothing is written.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit b5172ee)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 8965fff)
bugclerk pushed a commit to truenas/zfs that referenced this pull request Mar 10, 2026
prev_hdr is dereferenced after the sublist lock is dropped for write I/O
but nothing prevents it from being freed during that window. Eliminate
prev_hdr entirely and simplify persistent marker repositioning logic.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#18289
(cherry picked from commit 66a8726)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Accepted Ready to integrate (reviewed, tested)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants