-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Implement dynamic gang header sizes #17004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2ff3989
to
9843def
Compare
Here's a prototype of the changes to do variable leaf size allocations and use those for gang allocations. It has three primary parts. The first is changes to the allocation code to support allocation ranges, instead of specific sizes. The second is the concept of a "preallocated" write zio, which uses the normal write code path but skips parts of it since the allocation has already been done. The third part is changes to the gang creation code to take advantage of both of the above. I also added some tests for gang related code, since the test suite doesn't have that much in the way of those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach to variable size allocation, though code may use some polish. But in pre-allocated writes you've lost me. I haven't looked very close yet, but if we allocate space upfront, I wonder why it can't be implemented with existing rewrite zios?
More comments below, but I also wondering if we could implement variable size allocations first within existing gang blocks without introducing the new pool feature. I think it could help significantly by itself without breaking compatibility.
Rather than implementing disabling features, which is a little tricky, I am playing with a lighter-weight solution here. This PR adds the idea of a feature that isn't enabled by upgrade by default. You can still enable it manually, and if it's in your compat file it will be enabled. There are some quirks here; I'm not sure exactly how to handle the code in upgrade that prints all un-enabled features; should it display this one, even though it won't be added by upgrade? Should it skip this one, even though it isn't enabled? For now, I left it as the former; there are some upgrade tests that had to be tweaked to work with this approach. I'm curious if anyone has feelings about this idea, or preferences for what the behavior should be for listing unenabled features. |
f85a90b
to
7a9513d
Compare
This is a rebased version of the changes; the first commit is the entirety of #17111 squashed down. I haven't run the tests or anything, just trying to keep this from getting too far out of sync with master. |
re:
Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read? |
I believe we only track the TXG at which the feature was enabled. Because we use the small gang header unless we can't fit the necessary leaves, we can't just rely on that. A gang block created after the feature was enabled might use the small size, in order to avoid activating the feature unnecessarily. If we had activation_txg as well, we would also need to tweak the code to always use the larger header once the feature is active. Not a big change, but worth noting. |
7dafb0d
to
4fe393c
Compare
63af44f
to
26aa440
Compare
tests/zfs-tests/tests/functional/gang_blocks/gang_blocks_001_pos.ksh
Outdated
Show resolved
Hide resolved
One other idea would be to have "creation-only feature flags" that are only enabled at pool creation time. You could never upgrade to these features on existing pools, nor would they be listed on the "zpool upgrade" list. The benefit is we could safely enable it by default (on new pools), and it removes the need for "optional feature upgrades", and might simplify some of the codepaths. Sure, you wouldn't be able to upgrade existing pools, but that's an ephemeral cost. Over time, the fact that it's on by default for 100% of new pools will make up for it. One other question - does the code work in such a way that we will always know if |
My worry for this case would be that someone creates a boot pool with the new feature while unknowingly using an old bootloader that doesn't support it yet. It usually wouldn't be a huge issue (they'll find out when they boot the system onto the root pool for the first time), but it's an irritation. It is also valuable to be able to upgrade existing pools, since there are probably plenty of people out there with 4k pools who would like to use larger gang block headers without having to recreate their setups. That said, creation-only flags are probably simpler than this design. If people feel strongly about it I'm not opposed to switching.
I believe the way this works is that the label config contains a features_for_read list that stores all the label features the system needs to support in order to import the pool. During I think this is a general problem that had to be solved for any read-incompatible mos-required feature (e.g. hole birth, embedded BPs, device removal, draid, etc), so there shouldn't be anything special here. |
I can't figure out what's causing the ABI issues. When I run |
5e204dd
to
80edc68
Compare
It looks good to me now, but can we have more eyes here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks right & straightforward. As a reviewer, I am clearly benefiting from those that have gone before, thank you.
My comments are almost entirely just around general reading and understanding. Take them or leave them as you like. The only must-change is the comment referencing zio_gb_tail_t
. Easy approval after that, cheers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated ABI files in this PR don't match up at all with what I'd expect based on the code changes. Can you update the PR and drop the ABI files, then pull down the generated ABI files from the CI style checker. It seems like the changes should be minimal.
2bf2f5f
to
dfdb987
Compare
Here's what I get when pulling the updated ABI files from https://github.com/openzfs/zfs/actions/runs/16062189973?pr=17004 and applying them to your branch. You'll want to make sure not to generate new ABI files locally since they'll differ significantly based on the version of libabigail used. diff --git a/lib/libzfs/libzfs.abi b/lib/libzfs/libzfs.abi
index 35ecdca767..ecfd40efc4 100644
--- a/lib/libzfs/libzfs.abi
+++ b/lib/libzfs/libzfs.abi
@@ -631,7 +631,7 @@
<elf-symbol name='fletcher_4_superscalar_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='libzfs_config_ops' size='16' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='sa_protocol_names' size='16' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
- <elf-symbol name='spa_feature_table' size='2464' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
+ <elf-symbol name='spa_feature_table' size='2520' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='zfeature_checks_disable' size='4' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='zfs_deleg_perm_tab' size='528' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='zfs_history_event_names' size='328' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
@@ -6210,7 +6210,8 @@
<enumerator name='SPA_FEATURE_FAST_DEDUP' value='41'/>
<enumerator name='SPA_FEATURE_LONGNAME' value='42'/>
<enumerator name='SPA_FEATURE_LARGE_MICROZAP' value='43'/>
- <enumerator name='SPA_FEATURES' value='44'/>
+ <enumerator name='SPA_FEATURE_DYNAMIC_GANG_HEADER' value='44'/>
+ <enumerator name='SPA_FEATURES' value='45'/>
</enum-decl>
<typedef-decl name='spa_feature_t' type-id='33ecb627' id='d6618c78'/>
<qualified-type-def type-id='80f4b756' const='yes' id='b99c00c9'/>
@@ -9394,8 +9395,8 @@
</function-decl>
</abi-instr>
<abi-instr address-size='64' path='module/zcommon/zfeature_common.c' language='LANG_C99'>
- <array-type-def dimensions='1' type-id='83f29ca2' size-in-bits='19712' id='fd4573e5'>
- <subrange length='44' type-id='7359adad' id='cf8ba455'/>
+ <array-type-def dimensions='1' type-id='83f29ca2' size-in-bits='20160' id='b948da70'>
+ <subrange length='45' type-id='7359adad' id='cb8ddca0'/>
</array-type-def>
<enum-decl name='zfeature_flags' id='6db816a4'>
<underlying-type type-id='9cac1fee'/>
@@ -9403,6 +9404,7 @@
<enumerator name='ZFEATURE_FLAG_MOS' value='2'/>
<enumerator name='ZFEATURE_FLAG_ACTIVATE_ON_ENABLE' value='4'/>
<enumerator name='ZFEATURE_FLAG_PER_DATASET' value='8'/>
+ <enumerator name='ZFEATURE_FLAG_NO_UPGRADE' value='16'/>
</enum-decl>
<typedef-decl name='zfeature_flags_t' type-id='6db816a4' id='fc329033'/>
<enum-decl name='zfeature_type' id='c4fa2355'>
@@ -9472,7 +9474,7 @@
<pointer-type-def type-id='611586a1' size-in-bits='64' id='2e243169'/>
<qualified-type-def type-id='eaa32e2f' const='yes' id='83be723c'/>
<pointer-type-def type-id='83be723c' size-in-bits='64' id='7acd98a2'/>
- <var-decl name='spa_feature_table' type-id='fd4573e5' mangled-name='spa_feature_table' visibility='default' elf-symbol-id='spa_feature_table'/>
+ <var-decl name='spa_feature_table' type-id='b948da70' mangled-name='spa_feature_table' visibility='default' elf-symbol-id='spa_feature_table'/>
<var-decl name='zfeature_checks_disable' type-id='c19b74c3' mangled-name='zfeature_checks_disable' visibility='default' elf-symbol-id='zfeature_checks_disable'/>
<function-decl name='opendir' visibility='default' binding='global' size-in-bits='64'>
<parameter type-id='80f4b756'/> |
That's what I did for the latest version, and looking at the current changes that looks right. Are you seeing something different? |
Things look good to me now with your last push. I must have somehow been looking at a stale version of the PR. Sorry about the confusion. |
Adds a featureflag that is not enabled during upgrades unless listed explicitly. This is useful for features that could cause issues unless applied carefully; for example, a feature that could make a root pool unbootable if bootloaders don't yet have support for it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>
ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Allan Jude <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #17004
As discussed in the comments of PR openzfs#17004, you can theoretically run into a case where a gang child has more copies than the gang header, which can lead to some odd accounting behavior (and even trip a VERIFY). While the accounting code could be changed to handle this, it fundamentally doesn't seem to make a lot of sense to allow this to happen. If the data is supposed to have a certain level of reliability, that isn't actually achieved unless the gang_copies property is set to match it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17484
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Motivation and Context
ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with.
Description
The ZDB changes are first; basically, there are just some small tweaks to make it easier to work with gang blocks. First, the compact blkptr printer now notes which DVAs have the gang bit set. There is also a fix of a bug that's been around since 2009; ZDB gang block header printing has been broken since then. The problem is that if you do a
zio_read
of a BP with the gang bit set, you don't get back the header, you get back the underlying data. The fix is to just not set the gang bit.The way dynamic sized gang headers work is that the amount of space we allow ourselves to use for the gang header is equal to the size of the smallest possible allocation on the vdevs we got back when we allocated it. This is necessary to work around the fact that the ASIZE for a gang BP isn't the allocated size of the gang header, but of the entire tree under it. Because of that fact, when reading, claiming, or freeing, the allocated size of a gang header must be able to be determined from no other information than the vdev it was allocated on. We reuse the existing
vdev_gang_header_asize
for this. We take this minimum space, and pack the block full of blkptrs, up to a tail with an embedded checksum. This allows us to store many more gang children per header, leading to much shallower gang trees if we're forced into intense ganging.One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in
zio_checksum_error
, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.Much of the remaining changes are just tweaking the gang tree orchestration logic to work with a variable number of gang block pointers. The final interesting change is to the logic that determines the size of the gang children. When only 3 gang children were possible, it didn't really matter what the sector size was; you were going to be allocating at most 3 of them, so the space waste was limited. With large sector sizes, however, it could get expensive: A 128k allocation that gangs on a system with 64k sectors will, without changes, consume 256 64k sectors, with 512 bytes of data each, wasting 128x as much space as the original allocation. The fix is to use the spa_min_ashift rather than the MINBLOCKSIZE when calculating the lsize.
How Has This Been Tested?
In addition to the ZTS and zloop, extensive manual tests were performed, verifying that 1) the new ZDB functionality works, and 2) that larger gang headers are correctly used and allocated when applicable. Mixed ashift testing occured, as well as the full compatibility matrix of old pools, new pools without the feature enabled running against old and new code.
Types of changes
Checklist:
Signed-off-by
.