Skip to content

Implement dynamic gang header sizes #17004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

pcd1193182
Copy link
Contributor

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Motivation and Context

ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with.

Description

The ZDB changes are first; basically, there are just some small tweaks to make it easier to work with gang blocks. First, the compact blkptr printer now notes which DVAs have the gang bit set. There is also a fix of a bug that's been around since 2009; ZDB gang block header printing has been broken since then. The problem is that if you do a zio_read of a BP with the gang bit set, you don't get back the header, you get back the underlying data. The fix is to just not set the gang bit.

The way dynamic sized gang headers work is that the amount of space we allow ourselves to use for the gang header is equal to the size of the smallest possible allocation on the vdevs we got back when we allocated it. This is necessary to work around the fact that the ASIZE for a gang BP isn't the allocated size of the gang header, but of the entire tree under it. Because of that fact, when reading, claiming, or freeing, the allocated size of a gang header must be able to be determined from no other information than the vdev it was allocated on. We reuse the existing vdev_gang_header_asize for this. We take this minimum space, and pack the block full of blkptrs, up to a tail with an embedded checksum. This allows us to store many more gang children per header, leading to much shallower gang trees if we're forced into intense ganging.

One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in zio_checksum_error, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.

Much of the remaining changes are just tweaking the gang tree orchestration logic to work with a variable number of gang block pointers. The final interesting change is to the logic that determines the size of the gang children. When only 3 gang children were possible, it didn't really matter what the sector size was; you were going to be allocating at most 3 of them, so the space waste was limited. With large sector sizes, however, it could get expensive: A 128k allocation that gangs on a system with 64k sectors will, without changes, consume 256 64k sectors, with 512 bytes of data each, wasting 128x as much space as the original allocation. The fix is to use the spa_min_ashift rather than the MINBLOCKSIZE when calculating the lsize.

How Has This Been Tested?

In addition to the ZTS and zloop, extensive manual tests were performed, verifying that 1) the new ZDB functionality works, and 2) that larger gang headers are correctly used and allocated when applicable. Mixed ashift testing occured, as well as the full compatibility matrix of old pools, new pools without the feature enabled running against old and new code.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 2ff3989 to 9843def Compare February 4, 2025 21:13
@pcd1193182
Copy link
Contributor Author

Here's a prototype of the changes to do variable leaf size allocations and use those for gang allocations.

It has three primary parts. The first is changes to the allocation code to support allocation ranges, instead of specific sizes. The second is the concept of a "preallocated" write zio, which uses the normal write code path but skips parts of it since the allocation has already been done. The third part is changes to the gang creation code to take advantage of both of the above.

I also added some tests for gang related code, since the test suite doesn't have that much in the way of those.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach to variable size allocation, though code may use some polish. But in pre-allocated writes you've lost me. I haven't looked very close yet, but if we allocate space upfront, I wonder why it can't be implemented with existing rewrite zios?

More comments below, but I also wondering if we could implement variable size allocations first within existing gang blocks without introducing the new pool feature. I think it could help significantly by itself without breaking compatibility.

@pcd1193182
Copy link
Contributor Author

Rather than implementing disabling features, which is a little tricky, I am playing with a lighter-weight solution here. This PR adds the idea of a feature that isn't enabled by upgrade by default. You can still enable it manually, and if it's in your compat file it will be enabled. There are some quirks here; I'm not sure exactly how to handle the code in upgrade that prints all un-enabled features; should it display this one, even though it won't be added by upgrade? Should it skip this one, even though it isn't enabled? For now, I left it as the former; there are some upgrade tests that had to be tweaked to work with this approach. I'm curious if anyone has feelings about this idea, or preferences for what the behavior should be for listing unenabled features.

@github-actions github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Mar 8, 2025
@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from f85a90b to 7a9513d Compare April 1, 2025 20:31
@pcd1193182
Copy link
Contributor Author

This is a rebased version of the changes; the first commit is the entirety of #17111 squashed down. I haven't run the tests or anything, just trying to keep this from getting too far out of sync with master.

@allanjude
Copy link
Contributor

re:

One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in zio_checksum_error, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.

Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read?

@pcd1193182
Copy link
Contributor Author

Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read?

I believe we only track the TXG at which the feature was enabled. Because we use the small gang header unless we can't fit the necessary leaves, we can't just rely on that. A gang block created after the feature was enabled might use the small size, in order to avoid activating the feature unnecessarily.

If we had activation_txg as well, we would also need to tweak the code to always use the larger header once the feature is active. Not a big change, but worth noting.

@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 7dafb0d to 4fe393c Compare April 23, 2025 19:29
@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 63af44f to 26aa440 Compare May 8, 2025 18:13
@pcd1193182 pcd1193182 marked this pull request as ready for review May 8, 2025 18:13
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels May 8, 2025
@tonyhutter
Copy link
Contributor

For now, I left it as the former; there are some upgrade tests that had to be tweaked to work with this approach. I'm curious if anyone has feelings about this idea, or preferences for what the behavior should be for listing unenabled features.

One other idea would be to have "creation-only feature flags" that are only enabled at pool creation time. You could never upgrade to these features on existing pools, nor would they be listed on the "zpool upgrade" list. The benefit is we could safely enable it by default (on new pools), and it removes the need for "optional feature upgrades", and might simplify some of the codepaths.

Sure, you wouldn't be able to upgrade existing pools, but that's an ephemeral cost. Over time, the fact that it's on by default for 100% of new pools will make up for it.


One other question - does the code work in such a way that we will always know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active before we read/write our first data block? Or is it possible that we may read some gang blocks in early import before we know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active?

@pcd1193182
Copy link
Contributor Author

One other idea would be to have "creation-only feature flags" that are only enabled at pool creation time. You could never upgrade to these features on existing pools, nor would they be listed on the "zpool upgrade" list. The benefit is we could safely enable it by default (on new pools), and it removes the need for "optional feature upgrades", and might simplify some of the codepaths.

My worry for this case would be that someone creates a boot pool with the new feature while unknowingly using an old bootloader that doesn't support it yet. It usually wouldn't be a huge issue (they'll find out when they boot the system onto the root pool for the first time), but it's an irritation. It is also valuable to be able to upgrade existing pools, since there are probably plenty of people out there with 4k pools who would like to use larger gang block headers without having to recreate their setups.

That said, creation-only flags are probably simpler than this design. If people feel strongly about it I'm not opposed to switching.

One other question - does the code work in such a way that we will always know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active before we read/write our first data block? Or is it possible that we may read some gang blocks in early import before we know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active?

I believe the way this works is that the label config contains a features_for_read list that stores all the label features the system needs to support in order to import the pool. During spa_add we load them into spa_label_features, and then in spa_ld_select_uberblock we check those features for support. That happens before we actually issue any reads, so we should never run into problems where we try to read something before we know if the feature is active.

I think this is a general problem that had to be solved for any read-incompatible mos-required feature (e.g. hole birth, embedded BPs, device removal, draid, etc), so there shouldn't be anything special here.

@pcd1193182
Copy link
Contributor Author

I can't figure out what's causing the ABI issues. When I run checkabi on my test VMs, it doesn't have any errors.

@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 5e204dd to 80edc68 Compare July 1, 2025 18:15
@amotin
Copy link
Member

amotin commented Jul 2, 2025

It looks good to me now, but can we have more eyes here?

Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks right & straightforward. As a reviewer, I am clearly benefiting from those that have gone before, thank you.

My comments are almost entirely just around general reading and understanding. Take them or leave them as you like. The only must-change is the comment referencing zio_gb_tail_t. Easy approval after that, cheers!

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated ABI files in this PR don't match up at all with what I'd expect based on the code changes. Can you update the PR and drop the ABI files, then pull down the generated ABI files from the CI style checker. It seems like the changes should be minimal.

@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 2bf2f5f to dfdb987 Compare July 3, 2025 23:31
@behlendorf
Copy link
Contributor

Here's what I get when pulling the updated ABI files from https://github.com/openzfs/zfs/actions/runs/16062189973?pr=17004 and applying them to your branch. You'll want to make sure not to generate new ABI files locally since they'll differ significantly based on the version of libabigail used.

diff --git a/lib/libzfs/libzfs.abi b/lib/libzfs/libzfs.abi
index 35ecdca767..ecfd40efc4 100644
--- a/lib/libzfs/libzfs.abi
+++ b/lib/libzfs/libzfs.abi
@@ -631,7 +631,7 @@
     <elf-symbol name='fletcher_4_superscalar_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
     <elf-symbol name='libzfs_config_ops' size='16' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
     <elf-symbol name='sa_protocol_names' size='16' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
-    <elf-symbol name='spa_feature_table' size='2464' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
+    <elf-symbol name='spa_feature_table' size='2520' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
     <elf-symbol name='zfeature_checks_disable' size='4' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
     <elf-symbol name='zfs_deleg_perm_tab' size='528' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
     <elf-symbol name='zfs_history_event_names' size='328' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
@@ -6210,7 +6210,8 @@
       <enumerator name='SPA_FEATURE_FAST_DEDUP' value='41'/>
       <enumerator name='SPA_FEATURE_LONGNAME' value='42'/>
       <enumerator name='SPA_FEATURE_LARGE_MICROZAP' value='43'/>
-      <enumerator name='SPA_FEATURES' value='44'/>
+      <enumerator name='SPA_FEATURE_DYNAMIC_GANG_HEADER' value='44'/>
+      <enumerator name='SPA_FEATURES' value='45'/>
     </enum-decl>
     <typedef-decl name='spa_feature_t' type-id='33ecb627' id='d6618c78'/>
     <qualified-type-def type-id='80f4b756' const='yes' id='b99c00c9'/>
@@ -9394,8 +9395,8 @@
     </function-decl>
   </abi-instr>
   <abi-instr address-size='64' path='module/zcommon/zfeature_common.c' language='LANG_C99'>
-    <array-type-def dimensions='1' type-id='83f29ca2' size-in-bits='19712' id='fd4573e5'>
-      <subrange length='44' type-id='7359adad' id='cf8ba455'/>
+    <array-type-def dimensions='1' type-id='83f29ca2' size-in-bits='20160' id='b948da70'>
+      <subrange length='45' type-id='7359adad' id='cb8ddca0'/>
     </array-type-def>
     <enum-decl name='zfeature_flags' id='6db816a4'>
       <underlying-type type-id='9cac1fee'/>
@@ -9403,6 +9404,7 @@
       <enumerator name='ZFEATURE_FLAG_MOS' value='2'/>
       <enumerator name='ZFEATURE_FLAG_ACTIVATE_ON_ENABLE' value='4'/>
       <enumerator name='ZFEATURE_FLAG_PER_DATASET' value='8'/>
+      <enumerator name='ZFEATURE_FLAG_NO_UPGRADE' value='16'/>
     </enum-decl>
     <typedef-decl name='zfeature_flags_t' type-id='6db816a4' id='fc329033'/>
     <enum-decl name='zfeature_type' id='c4fa2355'>
@@ -9472,7 +9474,7 @@
     <pointer-type-def type-id='611586a1' size-in-bits='64' id='2e243169'/>
     <qualified-type-def type-id='eaa32e2f' const='yes' id='83be723c'/>
     <pointer-type-def type-id='83be723c' size-in-bits='64' id='7acd98a2'/>
-    <var-decl name='spa_feature_table' type-id='fd4573e5' mangled-name='spa_feature_table' visibility='default' elf-symbol-id='spa_feature_table'/>
+    <var-decl name='spa_feature_table' type-id='b948da70' mangled-name='spa_feature_table' visibility='default' elf-symbol-id='spa_feature_table'/>
     <var-decl name='zfeature_checks_disable' type-id='c19b74c3' mangled-name='zfeature_checks_disable' visibility='default' elf-symbol-id='zfeature_checks_disable'/>
     <function-decl name='opendir' visibility='default' binding='global' size-in-bits='64'>
       <parameter type-id='80f4b756'/>

@pcd1193182
Copy link
Contributor Author

Here's what I get when pulling the updated ABI files from https://github.com/openzfs/zfs/actions/runs/16062189973?pr=17004 and applying them to your branch. You'll want to make sure not to generate new ABI files locally since they'll differ significantly based on the version of libabigail used.
...

That's what I did for the latest version, and looking at the current changes that looks right. Are you seeing something different?

@behlendorf
Copy link
Contributor

Things look good to me now with your last push. I must have somehow been looking at a stale version of the PR. Sorry about the confusion.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 8, 2025
Paul Dagnelie added 3 commits July 8, 2025 10:28
Adds a featureflag that is not enabled during upgrades unless listed
explicitly. This is useful for features that could cause issues unless
applied carefully; for example, a feature that could make a root pool
unbootable if bootloaders don't yet have support for it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
ZFS gang block headers are currently fixed at 512 bytes. This is
increasingly wasteful in the era of larger disk sector sizes. This PR
allows any size allocation to work as a gang header. It also contains
supporting changes to ZDB to make gang headers easier to work with.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
@github-actions github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Jul 8, 2025
@amotin amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Jul 9, 2025
@behlendorf behlendorf closed this in e845be2 Jul 9, 2025
behlendorf pushed a commit that referenced this pull request Jul 9, 2025
ZFS gang block headers are currently fixed at 512 bytes. This is
increasingly wasteful in the era of larger disk sector sizes. This PR
allows any size allocation to work as a gang header. It also contains
supporting changes to ZDB to make gang headers easier to work with.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #17004
ixhamza pushed a commit to truenas/zfs that referenced this pull request Jul 14, 2025
As discussed in the comments of PR openzfs#17004, you can theoretically run
into a case where a gang child has more copies than the gang header,
which can lead to some odd accounting behavior (and even trip a
VERIFY). While the accounting code could be changed to handle this, it
fundamentally doesn't seem to make a lot of sense to allow this to
happen. If the data is supposed to have a certain level of reliability,
that isn't actually achieved unless the gang_copies property is set to
match it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17484
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants