Skip to content

Fix integer overflow in compression_store.rs data retrieval logic #1781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JannisFengler
Copy link

@JannisFengler JannisFengler commented May 21, 2025

Issue

nativelink-store/src/compression_store.rs had an integer overflow issue during data retrieval. On line 526, the code was calculating an end position by adding start_pos + remaining_bytes_to_send as usize, which could overflow when remaining_bytes_to_send was very large. When this overflow occurred, it would cause end_pos to wrap around to a smaller value, potentially making it smaller than start_pos in the slice operation, triggering an "attempt to add with overflow" panic.

Fix

Changed the standard addition operation to use Rust's saturating addition:

// Before
let end_pos = start_pos + remaining_bytes_to_send as usize;

// After
let end_pos = start_pos.saturating_add(remaining_bytes_to_send as usize);

This prevents integer overflow by capping the result at the maximum possible value for the type instead of wrapping around.

Testing

All tests now pass, including the previously failing regression test that demonstrated this issue.

Related Issues

Fixes #1566


This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented May 21, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for figuring this out!

+@jaroeichler +@jhpratt

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 2 files reviewed, and pending CI: Cargo Dev / macos-15, Installation / macos-14, Installation / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, Publish image, Publish nativelink-worker-init, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, windows-2022 / stable, and 9 discussions need to be resolved (waiting on @jaroeichler and @jhpratt)


nativelink-store/src/compression_store.rs line 526 at r1 (raw file):

                        // Use saturating_add to prevent overflow when remaining_bytes_to_send is very large
                        let end_pos = cmp::min(
                            start_pos.saturating_add(remaining_bytes_to_send as usize),

Interesting. So we were overflowing usize? In that case, would a try_from have prevented this? We haven't fixed clippy::cast_possible_truncation yet:

nativelink/.bazelrc

Lines 138 to 140 in d53363d

build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_truncation
build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_wrap
build --@rules_rust//:clippy_flag=-Aclippy::cast_precision_loss

In that case I'm slightly worried that while this fixes it, it masks the actual issue of the overflow wheras a try_from with a branch to properly handle the overflow case could be more explicit.


nativelink-store/tests/compression_store_test.rs line 518 at r1 (raw file):

async fn regression_test_range_start_not_greater_than_end() -> Result<(), Error> {
    // Create a store with a small block size to trigger multiple blocks
    const BLOCK_SIZE: u32 = 64 * 1024; // 64KB, same as DEFAULT_BLOCK_SIZE

If it's the same as DEFAULT_BLOCK_SIZE, couldn't we use that instead?


nativelink-store/tests/compression_store_test.rs line 534 at r1 (raw file):

    )
    .err_tip(|| "Failed to create compression store")?;
    let store = Pin::new(&store_owned);

Let's inline the stores as they're only needed once.


nativelink-store/tests/compression_store_test.rs line 537 at r1 (raw file):

    // Create a large buffer that spans multiple blocks
    let data_size = BLOCK_SIZE as usize * 3; // 3 blocks

Since we're trying to move to clippy::cast_possible_truncation, it would be nice to avoid too many uses of as in new code so that fixing it will be easier later on.


nativelink-store/tests/compression_store_test.rs line 550 at r1 (raw file):

    // These specific offsets test the case in the bug report where
    // start_pos was 65536 and end_pos was 65535
    for (offset, length) in &[

If this test was used in the old implementation I believe the first failure would prevent the other cases from running. I.e. if the example fails it hides the output of the cases after that.

Since we're reusing the stores it also seems like the different cases could influence each other.


nativelink-store/tests/compression_store_test.rs line 556 at r1 (raw file):

        // Specifically test the case where offset >= block size
        (u64::from(BLOCK_SIZE), Some(20u64)),
        // Specifically test the case that caused the bug (65536 and 65535)

Let's add the specific github issue here for easier lookup later.


nativelink-store/tests/compression_store_test.rs line 564 at r1 (raw file):

        (u64::from(BLOCK_SIZE) * 2, Some(u64::from(BLOCK_SIZE) - 1)), // Same issue at next block
    ] {
        // First test with get_part_unchunked

nit: Remove redundant comment


nativelink-store/tests/compression_store_test.rs line 576 at r1 (raw file):

        let store_data = result.unwrap();

        // Verify the data matches what we expect

nit: Remove redundant comment


nativelink-store/tests/compression_store_test.rs line 606 at r1 (raw file):

        // The error was happening in this method call
        let get_part_result = store.get_part(digest, &mut tx, *offset, *length).await;

If the call above panicked, isn't this part (and the ones below it) essentially dead code that would be better as a separate test? I.e. a test for the higher level get_part_unchunked and another one for the lower-level get_part might be good here.

Copy link
Contributor

@jaroeichler jaroeichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: 0 of 2 LGTMs obtained, and all files reviewed, and 9 discussions need to be resolved (waiting on @jhpratt)


nativelink-store/src/compression_store.rs line 526 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

Interesting. So we were overflowing usize? In that case, would a try_from have prevented this? We haven't fixed clippy::cast_possible_truncation yet:

nativelink/.bazelrc

Lines 138 to 140 in d53363d

build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_truncation
build --@rules_rust//:clippy_flag=-Aclippy::cast_possible_wrap
build --@rules_rust//:clippy_flag=-Aclippy::cast_precision_loss

In that case I'm slightly worried that while this fixes it, it masks the actual issue of the overflow wheras a try_from with a branch to properly handle the overflow case could be more explicit.

try_from addresses a different issue though. We should instead try to avoid these conversions. Can we define remaining_bytes_to_send as usize in the first place? In general, we have these kind of conversion issues in a lot of places, where functions related to memory sizes expect usize while we'd like to do things in u64.

For the addition, saturating_add is reasonable since this should mostly occur if length is None and then remaining_bytes_to_send is u64::MAX.


nativelink-store/tests/compression_store_test.rs line 534 at r1 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

Let's inline the stores as they're only needed once.

nit: If you change anything in the above part, please also adjust it in the other tests in this file.

Copy link
Contributor

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 LGTMs obtained, and all files reviewed, and 9 discussions need to be resolved (waiting on @jaroeichler and @jhpratt)


nativelink-store/src/compression_store.rs line 526 at r1 (raw file):

Previously, jaroeichler wrote…

try_from addresses a different issue though. We should instead try to avoid these conversions. Can we define remaining_bytes_to_send as usize in the first place? In general, we have these kind of conversion issues in a lot of places, where functions related to memory sizes expect usize while we'd like to do things in u64.

For the addition, saturating_add is reasonable since this should mostly occur if length is None and then remaining_bytes_to_send is u64::MAX.

Ah yeah saturating_add solves a different issue here.

Regarding the conversions, decompress_into returns usize, so we can't move everything to u64. What we could potentially do is indeed change some of these sizes to usize.

I played around with these conversions a bit, but it's somewhat tricky and I don't think that we can get rid of "all" conversions.

I'd propose:

  1. We use this fix for now as it solves the original issue
  2. After that we bump bincode which is horribly outdated and with the 2.x version we potentially get new flexibility
  3. We attempt to reduce some of the conversions by using usize where appropriate.
  4. We consider moving from lz4 to zstd as it has better compression ratio (at the cost of compression speed) and other store layers can act as in-between caches. I expect disk I/O and network transfers to be the main bottleneck here, meaning that the higher compression ratio likely outweighs the increase in compression effort, but we'll have to benchmark this to be sure.

@MarcusSorealheis
Copy link
Collaborator

@JannisFengler This is incredible. Thanks a lot for the fix. Could you please let us know if you need help wrapping this one up?

@JannisFengler
Copy link
Author

I will update this PR soon :)

@jhpratt jhpratt removed their assignment Jun 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Thread 'tokio-runtime-worker' panicked
6 participants