Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Parquet API interaction to use u64 (support files larger than 4GB in WASM) #7371

Merged
merged 18 commits into from
Apr 8, 2025

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Apr 1, 2025

Which issue does this PR close?

Closes Parquet Use U64 Instead of Usize #7238

Updated PR of #7252

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Yes, breaking change as the public API now uses u64 instead of usize.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 1, 2025
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Thanks for pushing this forward. Just one little nit.

@alamb alamb requested a review from tustvold April 3, 2025 20:38
@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

@tustvold since you requested the work in #7238 perhaps you have time to review this PR quickly to see if it does what you expect

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

I am in the process of preparing the 55 release. Are we happy with this PR? Or should we wait until the next major breaking change?

@kylebarron
Copy link
Contributor Author

I'm happy to go either way on #7371 (comment) comment.

I would love to get this in so that wasm32 can read >4gb files.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

I'm happy to go either way on #7371 (comment) comment.

I would love to get this in so that wasm32 can read >4gb files.

I agree -- let's try and get it in!

@alamb alamb added the api-change Changes to the arrow API label Apr 7, 2025
@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

I'll give this a review in a few minutes

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

Also AsyncFileReader

@kylebarron
Copy link
Contributor Author

Also AsyncFileReader

Are you mentioning this as a suggestion for this PR or as a note for what to update in DataFusion?

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

I played around with this a bit -- there are quite a few APIs that need to be updated... I'll make a PR shortly. Not sure if we want to do it at this point right before the release

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

Also AsyncFileReader

Are you mentioning this as a suggestion for this PR or as a note for what to update in DataFusion?

Sorry -- this was a note to myself of which APIs required the usize conversion during the datafusion upgrade

@kylebarron
Copy link
Contributor Author

there are quite a few APIs that need to be updated

You mean remaining in the

And the functions like try_parse etc (anything with a file size)

I'm not sure I follow what you're saying here. ParquetMetaDataReader::try_parse uses ChunkReader, and that already uses u64 (in parquet 54). In particular, ChunkReader::get_read and ChunkReader::get_bytes already use u64 for the start parameter. I don't think the length parameter needs to use u64, because we'll never (I assume) be making individual reads of >4GB, we just need to know relatively locations in a file that could be >4GB.

This is a point in favor of rolling back the change above as @etseidl suggested, since we'll never presumably need a suffix fetch of 4GB, and we only must use u64 for offsets, which also reduces the amount of changes here.

@kylebarron
Copy link
Contributor Author

This is a point in favor of rolling back the change above as @etseidl suggested, since we'll never presumably need a suffix fetch of 4GB, and we only must use u64 for offsets, which also reduces the amount of changes here.

I switched the suffix type back to usize as I agree we'll never be making suffix requests for larger than 2^32 bytes, and we should minimize the diff here for easiest adoption.

@etseidl
Copy link
Contributor

etseidl commented Apr 7, 2025

I'm happy with rolling back the change, but I think the point about the file_size parameter is correct. If we need to seek to the end of a file larger than 4gb we'll need u64 for that.

Sorry I can't elaborate more now but I'm at the dentist 🦷.

@kylebarron
Copy link
Contributor Author

I'm happy with rolling back the change, but I think the point about the file_size parameter is correct. If we need to seek to the end of a file larger than 4gb we'll need u64 for that.

Oh yes, I just noticed that myself. I'll change that now.

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

BTW I started a PR with what I think would really be required to support > 4GB files when usize is 32 bits:

I think it is pretty invasive and maybe not a great idea for this upcoming release

@tustvold
Copy link
Contributor

tustvold commented Apr 7, 2025

I'm afraid I don't really have time to deep-dive on this, but I would emphasise what other have pointed out, that a blanket find & replace of usize with u64 is NOT what we should do. There is no point switching to u64 for quantities that are either already in memory, e.g. offsets into a Bytes, or are assumed to fit - e.g. the metadata footer, a column chunk, etc...

The actual impact should be limited to file offsets

@kylebarron
Copy link
Contributor Author

Sigh. At least some of the places of code @alamb identified in kylebarron#57 I think are valid, at least changing file_size to u64 and changing column_index_range to be a Range<u64>.

Assuming that you want the release candidate to go out today I'm not sure we have time to fix this today.

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

The actual impact should be limited to file offsets

I was thinking the change also needs to impact anything that is used to compute a file offset (like lengths, for example)?

@tustvold
Copy link
Contributor

tustvold commented Apr 7, 2025

I was thinking the change also needs to impact anything that is used to compute a file offset (like lengths, for example)?

It depends on what the length represents, if it is the length of some quantity that is expected to fit into memory, usize is perfectly valid (and arguably more correct). If it is the length of the file in its entirety then yes that needs to be u64.

I would not expect to be changing metadata size or row group size quantities to be u64, for example. Ultimately if you have an API to fetch n bytes of data, making that a u64 only potentially changes an integer out of range error to an out of memory error.

@alamb
Copy link
Contributor

alamb commented Apr 7, 2025

Assuming that you want the release candidate to go out today I'm not sure we have time to fix this today.

I don't think we need to rush it out if we can get something ready to go

Alternately, we can also postpone this set of API changes until later in the summer (e.g. arrow and take our time

I don't think I'll have time to play around with it today, as I have some other things to attend to. I'll check back in tomorrow.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoping we can still get this in. I just flagged a few places where there could be truncation going from u64 to usize (there may be more).

@kylebarron
Copy link
Contributor Author

I don't think I'll have time to finish this this afternoon. @etseidl you're welcome to push this over the finish line if you have time. It looks like one of my changes from 6eb9484 (#7371) broke some tests too

@@ -658,13 +658,13 @@ impl ParquetMetaDataReader {

// Did not fetch the entire file metadata in the initial read, need to make a second request
if length > suffix_len - FOOTER_SIZE {
let metadata_start = file_size - (length - FOOTER_SIZE) as u64;
let metadata_start = file_size - (length + FOOTER_SIZE) as u64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I was going through this too quickly when adding the parentheses

@etseidl
Copy link
Contributor

etseidl commented Apr 8, 2025

@alamb please take a look now. I think I've hit all the important points from kylebarron#57. I believe the only thing left would be possibly changing the metadata length field to u32.

@alamb alamb changed the title Change Parquet API interaction to use u64 Change Parquet API interaction to use u64 (support files larger than 4GB in WASM) Apr 8, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kylebarron and @etseidl -- I went through this and I think it looks like a good change.

The only thing I think we need to do is revert the change to the deprecated APIs -- I will do this shortly.

I'll also test this change downstream in DataFusion and see if there are any other APIs we should update, but I think this got all the big ones.

@@ -80,10 +80,10 @@ pub use store::*;
/// [`tokio::fs::File`]: https://docs.rs/tokio/latest/tokio/fs/struct.File.html
pub trait AsyncFileReader: Send {
/// Retrieve the bytes in `range`
fn get_bytes(&mut self, range: Range<usize>) -> BoxFuture<'_, Result<Bytes>>;
fn get_bytes(&mut self, range: Range<u64>) -> BoxFuture<'_, Result<Bytes>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -325,7 +320,7 @@ mod tests {
let initial_actions = num_actions.load(Ordering::Relaxed);

let reader = ParquetObjectReader::new(store, meta.location)
.with_file_size(meta.size.try_into().unwrap())
.with_file_size(meta.size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a good example of how aligning these APIs makes things easier to understand

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

@alamb please take a look now. I think I've hit all the important points from kylebarron#57. I believe the only thing left would be possibly changing the metadata length field to u32.

In my opinion, supporting metadata larger than 4GB, while admirable from a completeness point of view, is likely not a super important usecase so I think keeping the API churn down is probably a good idea.

@kylebarron
Copy link
Contributor Author

In my opinion, supporting metadata larger than 4GB, while admirable from a completeness point of view, is likely not a super important usecase so I think keeping the API churn down is probably a good idea.

I think @etseidl 's point here is that since the metadata length has to fit in the last 4 bytes of the file, the Parquet spec doesn't support metadata larger than u32 anyways.

@etseidl
Copy link
Contributor

etseidl commented Apr 8, 2025

In my opinion, supporting metadata larger than 4GB, while admirable from a completeness point of view, is likely not a super important usecase so I think keeping the API churn down is probably a good idea.

I think @etseidl 's point here is that since the metadata length has to fit in the last 4 bytes of the file, the Parquet spec doesn't support metadata larger than u32 anyways.

Yes. But it does add some unnecessary thrash since that field is only ever initialized via a 4 byte array. No harm in leaving it as usize.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to all involved! This one took a village :)

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

I tested this PR out downstream in DataFusion and it cleaned up several of the rough edges:

In particular this commit shows the improvements (avoid having to translate Range --> Range in the AsyncReader traits)

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

Onwards!

@alamb
Copy link
Contributor

alamb commented Apr 8, 2025

Thanks again @kylebarron @etseidl and @tustvold

@alamb alamb merged commit 474f192 into apache:main Apr 8, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Use U64 Instead of Usize (wasm support for files greater than 4GB)
5 participants