Fix 2 bugs in non-raw send with encryption #17340

gamanakis · 2025-05-16T13:46:07Z

Motivation and Context

Closes #12014

Description

Bisecting identified the redacted send/receive as the source of the bug
for issue #12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) had been replaced by
dsl_dataset_hold_obj_flags() which would pass a DECRYPT flag and create
a key mapping. The problem here arises when trying to access the
MASTER_NODE_OBJ that key mapping is used and the code panics. Fix this
by restoring the call to dsl_dataset_hold_obj().

Later on in dmu_send_obj() replace dsl_dataset_rele(&to_ds) with
dsl_dataset_rele_flags(), otherwise the created key mapping (on to_ds)
earlier on is leaked, which result in a panic when exporting the
sending pool or unloading the zfs module after a non-raw send from
an encrypted filesystem.

How Has This Been Tested?

Manually running the scripts https://github.com/HankB/provoke_ZFS_corruption

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

gamanakis · 2025-05-16T16:38:52Z

How can I acknowledge @HankB and @pcd1193182 for the significant effort they put into this?

amotin

Makes sense to me. Though after searching around this area for couple hours I still have no idea why it helps with the corruptions. I think it is only a trigger, but for what?

tonyhutter · 2025-05-16T18:01:47Z

How can I acknowledge @HankB and @pcd1193182 for the significant effort they put into this?

Contributions-by: would be my recommendation. I see some of those in the git logs.

behlendorf

The fixes here make sense. But I agree, it's not at all clear to me how this would lead to a corruption. Was the final reproducer for this something which could be reasonably adapted and included in the test suite?

Bisecting identified the redacted send/receive as the source of the bug for issue openzfs#12014. Specifically the call to dsl_dataset_hold_obj(&fromds) had been replaced by dsl_dataset_hold_obj_flags() which would pass a DECRYPT flag and create a key mapping. The problem here arises when trying to access the MASTER_NODE_OBJ that key mapping is used and the code panics. Fix this by restoring the call to dsl_dataset_hold_obj(). Later on in dmu_send_obj() replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(), otherwise the created key mapping (on to_ds) earlier on is leaked, which result in a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <[email protected]> Contributions-by: Paul Dagnelie <[email protected]> Signed-off-by: George Amanakis <[email protected]>

ryao · 2025-05-17T01:45:43Z

The fixes here make sense. But I agree, it's not at all clear to me how this would lead to a corruption. Was the final reproducer for this something which could be reasonably adapted and included in the test suite?

I have a WIP CodeQL check for instances of this problem in my git repository:

https://github.com/ryao/zfs/tree/issue-12014

I am waiting to see if it works. If it does, we should think about adding checks for other functions that could be accidentally mismatched.

ryao · 2025-05-17T04:31:52Z

ryao/zfs@7c921e3 works and detected the two issues fixed in this patch. The current iteration is limited to cases where the mismatched hold and release functions are called from the same function. I might try to handle more cases before I open a PR with the check.

ryao · 2025-05-17T05:00:22Z

The current iteration is limited to cases where the mismatched hold and release functions are called from the same function. I might try to handle more cases before I open a PR with the check.

I just did a manual search to see if this is even necessary. It turns out that it is. dmu_objset_hold_flags() calls dsl_dataset_hold_flags(), which calls dsl_dataset_hold_obj_flags(). The returned dsl_dataset_t pointer is then passed to dsl_dataset_rele(), which is wrong, on error from dmu_objset_from_ds() in dmu_objset_hold_flags(). If I expand the power of the check to catch mismatched hold and release functions across function calls, the check should be able to identify this and possibly other bugs.

gamanakis · 2025-05-17T08:06:15Z

@ryao thank you for taking this to the next level.
@behlendorf I think expanding the ZTS would be great, but the scripts take about 2-3h to reproduce this.

github-actions bot added the Status: Work in Progress Not yet ready for general review label May 16, 2025

gamanakis force-pushed the pr12014 branch from faa1b9e to af37285 Compare May 16, 2025 16:37

amotin approved these changes May 16, 2025

View reviewed changes

behlendorf approved these changes May 16, 2025

View reviewed changes

gamanakis force-pushed the pr12014 branch from af37285 to 23e42f3 Compare May 16, 2025 20:31

ryao approved these changes May 17, 2025

View reviewed changes

robn approved these changes May 17, 2025

View reviewed changes

lzsaver mentioned this pull request May 17, 2025

[2.2.8] various Linux kernel updates #17325

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 2 bugs in non-raw send with encryption #17340

Fix 2 bugs in non-raw send with encryption #17340

gamanakis commented May 16, 2025 •

edited

Loading

gamanakis commented May 16, 2025 •

edited

Loading

amotin left a comment

tonyhutter commented May 16, 2025

behlendorf left a comment

ryao commented May 17, 2025

ryao commented May 17, 2025

ryao commented May 17, 2025 •

edited

Loading

gamanakis commented May 17, 2025

Fix 2 bugs in non-raw send with encryption #17340

Are you sure you want to change the base?

Fix 2 bugs in non-raw send with encryption #17340

Conversation

gamanakis commented May 16, 2025 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

gamanakis commented May 16, 2025 • edited Loading

amotin left a comment

Choose a reason for hiding this comment

tonyhutter commented May 16, 2025

behlendorf left a comment

Choose a reason for hiding this comment

ryao commented May 17, 2025

ryao commented May 17, 2025

ryao commented May 17, 2025 • edited Loading

gamanakis commented May 17, 2025

gamanakis commented May 16, 2025 •

edited

Loading

gamanakis commented May 16, 2025 •

edited

Loading

ryao commented May 17, 2025 •

edited

Loading