Skip to content

vdev_raidz_asize_to_psize: return psize, not asize #17488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 26, 2025

Conversation

robn
Copy link
Member

@robn robn commented Jun 26, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Since 246e588, gang blocks written to raidz vdevs will write past the end of their allocation, corrupting themselves, other data, or both.

The reason is simple - when allocating the gang children, we call vdev_psize_to_asize() to find out how much data we should load into the allocation we just did. vdev_raidz_asize_to_psize() had a bug; it computed the psize, but returned the original asize. The raidz layer dutifully writes that much out, into space beyond the end of the allocation.

If there's existing data there, it gets overwritten, causing checksum errors when that data is read. Even there's not data there (unlikely, given that gang blocks are in play at all), that area is not considered allocated, so can be allocated and overwritten later.

Hello, casual reader: this bug is only present on the master/development branch, not in any release version of OpenZFS, so if you’ve never touched the master branch, you’ve nothing to worry about. If you have been running the master branch from any time in the last couple of months with a raidz pool that you care about, you should probably scrub right quick. Even if it comes up clean, try zdb -b and see if there any gang blocks (more likely for full pools). And if so, probably you need to consider rebuilding your pool from backup.

Description

tl;dr: a one-character fix 😭

In lieu of anything more substantial, here’s some analysis.

A reproduction is fairly straightforward:

# create a gang block
echo 32768 | tee /sys/module/zfs/parameters/metaslab_force_ganging
echo 100 | tee /sys/module/zfs/parameters/metaslab_force_ganging_pct
zpool create -O compression=off tank raidz1 loop0 loop1 loop2 loop3
dd if=/dev/urandom of=/tank/file bs=64K count=1
zpool sync

zdb -dddddbbbbbb tank/ $(stat -c %i /tank/file) | grep '0 L0'
               0 L0 DVA[0]=<0:23800:10800> DVA[1]=<0:10023800:400> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE gang unique double size=10000L/10000P birth=9L/9P fill=1 cksum=00001fb9c8da1386:03f8dd31b2bf59de:b2d4ce759e591c96:4044e75e35564569

# show the child bps
quiz# zdb -dddddbbbbbb tank/ $(stat -c %i /tank/file) | perl -lnE  '/L0 DVA\[0]=<([^>]+)/ && print $1' | xargs -i zdb -R tank {}:g
Found vdev type: raidz
DVA[0]=<0:23c00:e800> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=e800L/e800P birth=9L/9P fill=0 cksum=00001cc369eac90e:0343692f122117eb:067c8a6e5de5d877:ca9519d57ce034d4
DVA[0]=<0:9800:1400> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=1400L/1400P birth=9L/9P fill=0 cksum=000002760d82cb43:000627551ea28e28:0a5351e4884f6e70:000a32eb2efa03c7
DVA[0]=<0:32400:800> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=400L/400P birth=9L/9P fill=0 cksum=00000080516c7f35:000042247e7a1ccb:0016840f44a7e02f:05b1fb9e1fdea0ce

Note that the asize in the DVA is the same as the lsize/psize, which is not actually right for raidz. With the fix, it's more like this:

DVA[0]=<0:23400:e800> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=ae00L/ae00P birth=9L/9P fill=0 cksum=000015f55b07bc3d:01dc0309bdcee547:f98036f5c22954cc:0b81ca7f3ad997c8
DVA[0]=<0:2000:3c00> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=2c00L/2c00P birth=9L/9P fill=0 cksum=000005851aec2d91:001e0c282e4370ce:6da97af36f3de31f:e286599c97a4aa5d
DVA[0]=<0:31c00:3400> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=2600L/2600P birth=9L/9P fill=0 cksum=000004ab0a427614:001612d9b84a7400:45549d02dfecbb73:73db21730709b094

Corruption is easier to show with a program like this:

echo 32768 | tee /sys/module/zfs/parameters/metaslab_force_ganging
echo 100 | tee /sys/module/zfs/parameters/metaslab_force_ganging_pct
zpool create -O compression=off tank raidz1 loop0 loop1 loop2 loop3

ss=$(seq 1 128)

for s in $ss ; do
  bs=$(($s*1024))
  echo "$s $bs"
  dd if=/dev/zero bs=$bs count=1 status=none | tr -c '' "\\$(printf %o $s)" > /tmp/file
  dd if=/tmp/file of=/tank/file$s bs=$bs count=1 status=none
  zpool sync
done

zpool export -a
zpool import -a

for s in $ss ; do
  if ! cat /tank/file$s > /dev/null 2>&1 ; then
    echo "bad: $s"
  fi
done

Without the patch, some number of files will usually be unreadable after the import:

bad: 4
bad: 5
bad: 7
bad: 8
bad: 10
bad: 11
bad: 13
bad: 14

Note that many of them are not even ganged: each numbered file is of that number kilobytes in size, so the first 32 are <= the force ganging threshold! Looking inside:

# zdb -bbbbbbddddd tank/ $(stat -c %i /tank/file4) | grep 'L0 DVA'
               0 L0 DVA[0]=<0:3d000:1800> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=1000L/1000P birth=18L/18P fill=1 cksum=0000001010101000:0000202828280800:002af5a5a57ab000:2b15dde1b6cc0400

Viewing the first part of the on-disk data at that DVA:

# zdb -R tank 0:3d000:1800 | head
Found vdev type: raidz

0:3d000:1800
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000010:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000020:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000030:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000040:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000050:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000060:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::

The test writes the file number as contents, so this is actually data from file 58. If we look at that, we see it did gang:

# zdb -bbbbbbddddd tank/ $(stat -c %i /tank/file58) | grep 'L0 DVA'
               0 L0 DVA[0]=<0:394000:f000> DVA[1]=<0:10210000:400> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE gang unique double size=e800L/e800P birth=180L/180P fill=1 cksum=00000d3131312400:017e992b29ac9200:e5e5a897cecf0c00:2e78d7a895494900

# zdb -R tank 0:394000:f000:g
Found vdev type: raidz
DVA[0]=<0:89000:d000> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=d000L/d000P birth=180L/180P fill=0 cksum=00000bd3d3d3c800:0133896d6c39e400:d3706ef233969800:54369fd44c68f200
DVA[0]=<0:3bc00:1400> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=1400L/1400P birth=180L/180P fill=0 cksum=0000012323232200:0002d86969669100:04bfeaa09be0b600:f37538bb4dcc0880
DVA[0]=<0:394c00:800> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique single size=400L/400P birth=180L/180P fill=0 cksum=0000003a3a3a3a00:00001d3a3a3a1d00:0009d18f8f85be00:027bc10f8d13ce80

file4 above starts at 0:3d000:1800, whild file58's second gang child starts at 0:3bc00:1400, immediately before it. 0x3bc00 + 0x1400 = 0x3d000, so the allocation is correct, but if we dump that block and just a little more, we see that it overwrote it's allocation:

# zdb -R tank 0:3bc00:1800
Found vdev type: raidz

0:3bc00:1800
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000010:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000020:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000030:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000040:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000050:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000060:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
...
0015c0:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
0015d0:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
0015e0:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
0015f0:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
001600:  0404040404040404  0404040404040404  ................
001610:  0404040404040404  0404040404040404  ................
001620:  0404040404040404  0404040404040404  ................
001630:  0404040404040404  0404040404040404  ................
...

That's 512 bytes over, which we can see with another look at file4:

# zdb -R tank 0:3d000:1800 | head -40
Found vdev type: raidz

0:3d000:1800
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000010:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000020:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000030:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
...
0001e0:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
0001f0:  3a3a3a3a3a3a3a3a  3a3a3a3a3a3a3a3a  ::::::::::::::::
000200:  0404040404040404  0404040404040404  ................
000210:  0404040404040404  0404040404040404  ................
...

Had the fix been in place, asize_to_psize would have been ((0x1800 >> 9) - (1 * (0x1800 >> 9) / 3)) << 9 = 0x1000, well within limits.

How Has This Been Tested?

ZTS run in progress, but I do not expect any issues - this is only used by ganging on raidz, and we don’t really exercise that at all.

At least, I should probably turn the above reproduction into a test case: write a bunch of files under and over a low ganging threshold, reimport, and make sure we can read them all (and probably their content checksums match too).

I would have liked to add more protection in the code. I tried adding asserts in asize_to_psize and psize_to_asize to compute the inverse and ensure psize remains under asize (commit: b741cfd) and it “works”, but it tripped in the vdev replacement tests. I was expecting something like that, knowing the theory of vdev_indirect but never having read the code. It confirmed what I already thought, which is there’s no actual need for psize and asize to be at all correlated, so long as the vdev understands what is happening.

I do wonder if there’s a more general thing we can do to check that we’re not writing “too much” if there’s a block pointer available at time of write, or something like it. Again though, asize isn’t psize, so what would that even mean?

I’m also a little miffed that the compiler couldn’t tell me hey, you’ve done all this computation on the stack and then just thrown it away. I didn’t look yet to see if there’s a compiler flag that would help; I will soon, and if its there, turn it on to see what happens.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Since 246e588, gang blocks written to raidz vdevs will write past the
end of their allocation, corrupting themselves, other data, or both.

The reason is simple - when allocating the gang children, we call
vdev_psize_to_asize() to find out how much data we should load into the
allocation we just did. vdev_raidz_asize_to_psize() had a bug; it
computed the psize, but returned the original asize. The raidz layer
dutifully writes that much out, into space beyond the end of the
allocation.

If there's existing data there, it gets overwritten, causing checksum
errors when that data is read. Even there's not data there (unlikely,
given that gang blocks are in play at all), that area is not considered
allocated, so can be allocated and overwritten later.

The fix is simple: return the psize we just computed.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn requested a review from pcd1193182 June 26, 2025 11:04
@amotin amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Jun 26, 2025
@amotin amotin merged commit ea076d6 into openzfs:master Jun 26, 2025
23 checks passed
@2-5
Copy link

2-5 commented Jul 13, 2025

@robn I tried asking Gemini to find the bug, and it did - https://g.co/gemini/share/28279f71bc53

Maybe all new PRs could be passed through a "there is a nasty bug in this PR, find it" filter?

@robn robn mentioned this pull request Jul 18, 2025
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants