vdev_raidz_asize_to_psize
: return psize
, not asize
#17488
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
Since 246e588, gang blocks written to raidz vdevs will write past the end of their allocation, corrupting themselves, other data, or both.
The reason is simple - when allocating the gang children, we call
vdev_psize_to_asize()
to find out how much data we should load into the allocation we just did.vdev_raidz_asize_to_psize()
had a bug; it computed thepsize
, but returned the originalasize
. The raidz layer dutifully writes that much out, into space beyond the end of the allocation.If there's existing data there, it gets overwritten, causing checksum errors when that data is read. Even there's not data there (unlikely, given that gang blocks are in play at all), that area is not considered allocated, so can be allocated and overwritten later.
Hello, casual reader: this bug is only present on the master/development branch, not in any release version of OpenZFS, so if you’ve never touched the master branch, you’ve nothing to worry about. If you have been running the master branch from any time in the last couple of months with a raidz pool that you care about, you should probably scrub right quick. Even if it comes up clean, try
zdb -b
and see if there any gang blocks (more likely for full pools). And if so, probably you need to consider rebuilding your pool from backup.Description
tl;dr: a one-character fix 😭
In lieu of anything more substantial, here’s some analysis.
A reproduction is fairly straightforward:
Note that the asize in the DVA is the same as the lsize/psize, which is not actually right for raidz. With the fix, it's more like this:
Corruption is easier to show with a program like this:
Without the patch, some number of files will usually be unreadable after the import:
Note that many of them are not even ganged: each numbered file is of that number kilobytes in size, so the first 32 are <= the force ganging threshold! Looking inside:
Viewing the first part of the on-disk data at that DVA:
The test writes the file number as contents, so this is actually data from file 58. If we look at that, we see it did gang:
file4 above starts at
0:3d000:1800
, whild file58's second gang child starts at0:3bc00:1400
, immediately before it.0x3bc00 + 0x1400 = 0x3d000
, so the allocation is correct, but if we dump that block and just a little more, we see that it overwrote it's allocation:That's 512 bytes over, which we can see with another look at file4:
Had the fix been in place,
asize_to_psize
would have been((0x1800 >> 9) - (1 * (0x1800 >> 9) / 3)) << 9 = 0x1000
, well within limits.How Has This Been Tested?
ZTS run in progress, but I do not expect any issues - this is only used by ganging on raidz, and we don’t really exercise that at all.
At least, I should probably turn the above reproduction into a test case: write a bunch of files under and over a low ganging threshold, reimport, and make sure we can read them all (and probably their content checksums match too).
I would have liked to add more protection in the code. I tried adding asserts in
asize_to_psize
andpsize_to_asize
to compute the inverse and ensurepsize
remains underasize
(commit: b741cfd) and it “works”, but it tripped in the vdev replacement tests. I was expecting something like that, knowing the theory ofvdev_indirect
but never having read the code. It confirmed what I already thought, which is there’s no actual need forpsize
andasize
to be at all correlated, so long as the vdev understands what is happening.I do wonder if there’s a more general thing we can do to check that we’re not writing “too much” if there’s a block pointer available at time of write, or something like it. Again though,
asize
isn’tpsize
, so what would that even mean?I’m also a little miffed that the compiler couldn’t tell me hey, you’ve done all this computation on the stack and then just thrown it away. I didn’t look yet to see if there’s a compiler flag that would help; I will soon, and if its there, turn it on to see what happens.
Types of changes
Checklist:
Signed-off-by
.