-
Notifications
You must be signed in to change notification settings - Fork 194
Fix integer overflow in compression_store.rs data retrieval logic #1781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix integer overflow in compression_store.rs data retrieval logic #1781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for figuring this out!
Reviewable status: 0 of 2 LGTMs obtained, and 0 of 2 files reviewed, and pending CI: Cargo Dev / macos-15, Installation / macos-14, Installation / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, Publish image, Publish nativelink-worker-init, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, windows-2022 / stable, and 9 discussions need to be resolved (waiting on @jaroeichler and @jhpratt)
nativelink-store/src/compression_store.rs
line 526 at r1 (raw file):
// Use saturating_add to prevent overflow when remaining_bytes_to_send is very large let end_pos = cmp::min( start_pos.saturating_add(remaining_bytes_to_send as usize),
Interesting. So we were overflowing usize
? In that case, would a try_from
have prevented this? We haven't fixed clippy::cast_possible_truncation
yet:
Lines 138 to 140 in d53363d
build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_truncation | |
build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_wrap | |
build --@rules_rust//:clippy_flag=-Aclippy::cast_precision_loss |
In that case I'm slightly worried that while this fixes it, it masks the actual issue of the overflow wheras a try_from
with a branch to properly handle the overflow case could be more explicit.
nativelink-store/tests/compression_store_test.rs
line 518 at r1 (raw file):
async fn regression_test_range_start_not_greater_than_end() -> Result<(), Error> { // Create a store with a small block size to trigger multiple blocks const BLOCK_SIZE: u32 = 64 * 1024; // 64KB, same as DEFAULT_BLOCK_SIZE
If it's the same as DEFAULT_BLOCK_SIZE
, couldn't we use that instead?
nativelink-store/tests/compression_store_test.rs
line 534 at r1 (raw file):
) .err_tip(|| "Failed to create compression store")?; let store = Pin::new(&store_owned);
Let's inline the stores as they're only needed once.
nativelink-store/tests/compression_store_test.rs
line 537 at r1 (raw file):
// Create a large buffer that spans multiple blocks let data_size = BLOCK_SIZE as usize * 3; // 3 blocks
Since we're trying to move to clippy::cast_possible_truncation
, it would be nice to avoid too many uses of as
in new code so that fixing it will be easier later on.
nativelink-store/tests/compression_store_test.rs
line 550 at r1 (raw file):
// These specific offsets test the case in the bug report where // start_pos was 65536 and end_pos was 65535 for (offset, length) in &[
If this test was used in the old implementation I believe the first failure would prevent the other cases from running. I.e. if the example fails it hides the output of the cases after that.
Since we're reusing the stores it also seems like the different cases could influence each other.
nativelink-store/tests/compression_store_test.rs
line 556 at r1 (raw file):
// Specifically test the case where offset >= block size (u64::from(BLOCK_SIZE), Some(20u64)), // Specifically test the case that caused the bug (65536 and 65535)
Let's add the specific github issue here for easier lookup later.
nativelink-store/tests/compression_store_test.rs
line 564 at r1 (raw file):
(u64::from(BLOCK_SIZE) * 2, Some(u64::from(BLOCK_SIZE) - 1)), // Same issue at next block ] { // First test with get_part_unchunked
nit: Remove redundant comment
nativelink-store/tests/compression_store_test.rs
line 576 at r1 (raw file):
let store_data = result.unwrap(); // Verify the data matches what we expect
nit: Remove redundant comment
nativelink-store/tests/compression_store_test.rs
line 606 at r1 (raw file):
// The error was happening in this method call let get_part_result = store.get_part(digest, &mut tx, *offset, *length).await;
If the call above panicked, isn't this part (and the ones below it) essentially dead code that would be better as a separate test? I.e. a test for the higher level get_part_unchunked
and another one for the lower-level get_part
might be good here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: 0 of 2 LGTMs obtained, and all files reviewed, and 9 discussions need to be resolved (waiting on @jhpratt)
nativelink-store/src/compression_store.rs
line 526 at r1 (raw file):
Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…
Interesting. So we were overflowing
usize
? In that case, would atry_from
have prevented this? We haven't fixedclippy::cast_possible_truncation
yet:Lines 138 to 140 in d53363d
build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_truncation build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_wrap build --@rules_rust//:clippy_flag=-Aclippy::cast_precision_loss In that case I'm slightly worried that while this fixes it, it masks the actual issue of the overflow wheras a
try_from
with a branch to properly handle the overflow case could be more explicit.
try_from
addresses a different issue though. We should instead try to avoid these conversions. Can we define remaining_bytes_to_send
as usize
in the first place? In general, we have these kind of conversion issues in a lot of places, where functions related to memory sizes expect usize
while we'd like to do things in u64
.
For the addition, saturating_add
is reasonable since this should mostly occur if length
is None
and then remaining_bytes_to_send
is u64::MAX
.
nativelink-store/tests/compression_store_test.rs
line 534 at r1 (raw file):
Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…
Let's inline the stores as they're only needed once.
nit: If you change anything in the above part, please also adjust it in the other tests in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 LGTMs obtained, and all files reviewed, and 9 discussions need to be resolved (waiting on @jaroeichler and @jhpratt)
nativelink-store/src/compression_store.rs
line 526 at r1 (raw file):
Previously, jaroeichler wrote…
try_from
addresses a different issue though. We should instead try to avoid these conversions. Can we defineremaining_bytes_to_send
asusize
in the first place? In general, we have these kind of conversion issues in a lot of places, where functions related to memory sizes expectusize
while we'd like to do things inu64
.For the addition,
saturating_add
is reasonable since this should mostly occur iflength
isNone
and thenremaining_bytes_to_send
isu64::MAX
.
Ah yeah saturating_add
solves a different issue here.
Regarding the conversions, decompress_into
returns usize
, so we can't move everything to u64
. What we could potentially do is indeed change some of these sizes to usize
.
I played around with these conversions a bit, but it's somewhat tricky and I don't think that we can get rid of "all" conversions.
I'd propose:
- We use this fix for now as it solves the original issue
- After that we bump
bincode
which is horribly outdated and with the2.x
version we potentially get new flexibility - We attempt to reduce some of the conversions by using
usize
where appropriate. - We consider moving from
lz4
tozstd
as it has better compression ratio (at the cost of compression speed) and other store layers can act as in-between caches. I expect disk I/O and network transfers to be the main bottleneck here, meaning that the higher compression ratio likely outweighs the increase in compression effort, but we'll have to benchmark this to be sure.
@JannisFengler This is incredible. Thanks a lot for the fix. Could you please let us know if you need help wrapping this one up? |
I will update this PR soon :) |
Issue
nativelink-store/src/compression_store.rs
had an integer overflow issue during data retrieval. On line 526, the code was calculating an end position by addingstart_pos + remaining_bytes_to_send as usize
, which could overflow whenremaining_bytes_to_send
was very large. When this overflow occurred, it would causeend_pos
to wrap around to a smaller value, potentially making it smaller thanstart_pos
in the slice operation, triggering an "attempt to add with overflow" panic.Fix
Changed the standard addition operation to use Rust's saturating addition:
This prevents integer overflow by capping the result at the maximum possible value for the type instead of wrapping around.
Testing
All tests now pass, including the previously failing regression test that demonstrated this issue.
Related Issues
Fixes #1566
This change is