Skip to content

Fix 2 bugs in non-raw send with encryption #17340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gamanakis
Copy link
Contributor

@gamanakis gamanakis commented May 16, 2025

Motivation and Context

Closes #12014

Description

Bisecting identified the redacted send/receive as the source of the bug
for issue #12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) had been replaced by
dsl_dataset_hold_obj_flags() which would pass a DECRYPT flag and create
a key mapping. The problem here arises when trying to access the
MASTER_NODE_OBJ that key mapping is used and the code panics. Fix this
by restoring the call to dsl_dataset_hold_obj().

Later on in dmu_send_obj() replace dsl_dataset_rele(&to_ds) with
dsl_dataset_rele_flags(), otherwise the created key mapping (on to_ds)
earlier on is leaked, which result in a panic when exporting the
sending pool or unloading the zfs module after a non-raw send from
an encrypted filesystem.

How Has This Been Tested?

Manually running the scripts https://github.com/HankB/provoke_ZFS_corruption

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label May 16, 2025
@gamanakis
Copy link
Contributor Author

gamanakis commented May 16, 2025

How can I acknowledge @HankB and @pcd1193182 for the significant effort they put into this?

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Though after searching around this area for couple hours I still have no idea why it helps with the corruptions. I think it is only a trigger, but for what?

@tonyhutter
Copy link
Contributor

How can I acknowledge @HankB and @pcd1193182 for the significant effort they put into this?

Contributions-by: would be my recommendation. I see some of those in the git logs.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixes here make sense. But I agree, it's not at all clear to me how this would lead to a corruption. Was the final reproducer for this something which could be reasonably adapted and included in the test suite?

Bisecting identified the redacted send/receive as the source of the bug
for issue openzfs#12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) had been replaced by
dsl_dataset_hold_obj_flags() which would pass a DECRYPT flag and create
a key mapping. The problem here arises when trying to access the
MASTER_NODE_OBJ that key mapping is used and the code panics. Fix this
by restoring the call to dsl_dataset_hold_obj().

Later on in dmu_send_obj() replace dsl_dataset_rele(&to_ds) with
dsl_dataset_rele_flags(), otherwise the created key mapping (on to_ds)
earlier on is leaked, which result in a panic when exporting the
sending pool or unloading the zfs module after a non-raw send from
an encrypted filesystem.

Contributions-by: Hank Barta <[email protected]>
Contributions-by: Paul Dagnelie <[email protected]>
Signed-off-by: George Amanakis <[email protected]>
@ryao
Copy link
Contributor

ryao commented May 17, 2025

The fixes here make sense. But I agree, it's not at all clear to me how this would lead to a corruption. Was the final reproducer for this something which could be reasonably adapted and included in the test suite?

I have a WIP CodeQL check for instances of this problem in my git repository:

https://github.com/ryao/zfs/tree/issue-12014

I am waiting to see if it works. If it does, we should think about adding checks for other functions that could be accidentally mismatched.

@ryao
Copy link
Contributor

ryao commented May 17, 2025

ryao/zfs@7c921e3 works and detected the two issues fixed in this patch. The current iteration is limited to cases where the mismatched hold and release functions are called from the same function. I might try to handle more cases before I open a PR with the check.

@ryao
Copy link
Contributor

ryao commented May 17, 2025

The current iteration is limited to cases where the mismatched hold and release functions are called from the same function. I might try to handle more cases before I open a PR with the check.

I just did a manual search to see if this is even necessary. It turns out that it is. dmu_objset_hold_flags() calls dsl_dataset_hold_flags(), which calls dsl_dataset_hold_obj_flags(). The returned dsl_dataset_t pointer is then passed to dsl_dataset_rele(), which is wrong, on error from dmu_objset_from_ds() in dmu_objset_hold_flags(). If I expand the power of the check to catch mismatched hold and release functions across function calls, the check should be able to identify this and possibly other bugs.

@gamanakis
Copy link
Contributor Author

@ryao thank you for taking this to the next level.
@behlendorf I think expanding the ZTS would be great, but the scripts take about 2-3h to reproduce this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZFS corruption related to snapshots post-2.0.x upgrade
6 participants