-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Parquet API interaction to use u64
(support files larger than 4GB in WASM)
#7371
Conversation
Signed-off-by: Arpit Bandejiya <[email protected]>
Signed-off-by: Arpit Bandejiya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Thanks for pushing this forward. Just one little nit.
I am in the process of preparing the 55 release. Are we happy with this PR? Or should we wait until the next major breaking change? |
I'm happy to go either way on #7371 (comment) comment. I would love to get this in so that wasm32 can read >4gb files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hit a few APIs while testing upgrade to DataFusion
Specifically: https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataReader.html
And the functions like try_parse
etc (anything with a file size)
Also https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.MetadataFetch.html
I agree -- let's try and get it in! |
I'll give this a review in a few minutes |
Also |
Are you mentioning this as a suggestion for this PR or as a note for what to update in DataFusion? |
I played around with this a bit -- there are quite a few APIs that need to be updated... I'll make a PR shortly. Not sure if we want to do it at this point right before the release |
Sorry -- this was a note to myself of which APIs required the usize conversion during the datafusion upgrade |
You mean remaining in the
I'm not sure I follow what you're saying here. This is a point in favor of rolling back the change above as @etseidl suggested, since we'll never presumably need a suffix fetch of 4GB, and we only must use |
I switched the suffix type back to |
I'm happy with rolling back the change, but I think the point about the file_size parameter is correct. If we need to seek to the end of a file larger than 4gb we'll need u64 for that. Sorry I can't elaborate more now but I'm at the dentist 🦷. |
Oh yes, I just noticed that myself. I'll change that now. |
BTW I started a PR with what I think would really be required to support > 4GB files when I think it is pretty invasive and maybe not a great idea for this upcoming release |
I'm afraid I don't really have time to deep-dive on this, but I would emphasise what other have pointed out, that a blanket find & replace of usize with u64 is NOT what we should do. There is no point switching to u64 for quantities that are either already in memory, e.g. offsets into a The actual impact should be limited to file offsets |
Sigh. At least some of the places of code @alamb identified in kylebarron#57 I think are valid, at least changing Assuming that you want the release candidate to go out today I'm not sure we have time to fix this today. |
I was thinking the change also needs to impact anything that is used to compute a file offset (like lengths, for example)? |
It depends on what the length represents, if it is the length of some quantity that is expected to fit into memory, usize is perfectly valid (and arguably more correct). If it is the length of the file in its entirety then yes that needs to be I would not expect to be changing metadata size or row group size quantities to be |
I don't think we need to rush it out if we can get something ready to go Alternately, we can also postpone this set of API changes until later in the summer (e.g. arrow and take our time I don't think I'll have time to play around with it today, as I have some other things to attend to. I'll check back in tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoping we can still get this in. I just flagged a few places where there could be truncation going from u64
to usize
(there may be more).
I don't think I'll have time to finish this this afternoon. @etseidl you're welcome to push this over the finish line if you have time. It looks like one of my changes from |
@@ -658,13 +658,13 @@ impl ParquetMetaDataReader { | |||
|
|||
// Did not fetch the entire file metadata in the initial read, need to make a second request | |||
if length > suffix_len - FOOTER_SIZE { | |||
let metadata_start = file_size - (length - FOOTER_SIZE) as u64; | |||
let metadata_start = file_size - (length + FOOTER_SIZE) as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes I was going through this too quickly when adding the parentheses
@alamb please take a look now. I think I've hit all the important points from kylebarron#57. I believe the only thing left would be possibly changing the metadata length field to u32. |
u64
u64
(support files larger than 4GB in WASM)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kylebarron and @etseidl -- I went through this and I think it looks like a good change.
The only thing I think we need to do is revert the change to the deprecated APIs -- I will do this shortly.
I'll also test this change downstream in DataFusion and see if there are any other APIs we should update, but I think this got all the big ones.
@@ -80,10 +80,10 @@ pub use store::*; | |||
/// [`tokio::fs::File`]: https://docs.rs/tokio/latest/tokio/fs/struct.File.html | |||
pub trait AsyncFileReader: Send { | |||
/// Retrieve the bytes in `range` | |||
fn get_bytes(&mut self, range: Range<usize>) -> BoxFuture<'_, Result<Bytes>>; | |||
fn get_bytes(&mut self, range: Range<u64>) -> BoxFuture<'_, Result<Bytes>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -325,7 +320,7 @@ mod tests { | |||
let initial_actions = num_actions.load(Ordering::Relaxed); | |||
|
|||
let reader = ParquetObjectReader::new(store, meta.location) | |||
.with_file_size(meta.size.try_into().unwrap()) | |||
.with_file_size(meta.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a good example of how aligning these APIs makes things easier to understand
In my opinion, supporting metadata larger than 4GB, while admirable from a completeness point of view, is likely not a super important usecase so I think keeping the API churn down is probably a good idea. |
I think @etseidl 's point here is that since the metadata length has to fit in the last 4 bytes of the file, the Parquet spec doesn't support metadata larger than u32 anyways. |
Yes. But it does add some unnecessary thrash since that field is only ever initialized via a 4 byte array. No harm in leaving it as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to all involved! This one took a village :)
I tested this PR out downstream in DataFusion and it cleaned up several of the rough edges: In particular this commit shows the improvements (avoid having to translate Range --> Range in the AsyncReader traits) |
Onwards! |
Thanks again @kylebarron @etseidl and @tustvold |
Which issue does this PR close?
Closes Parquet Use U64 Instead of Usize #7238
Updated PR of #7252
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?
Yes, breaking change as the public API now uses u64 instead of usize.