Change Parquet API interaction to use `u64` (support files larger than 4GB in WASM) #7371

kylebarron · 2025-04-01T18:11:45Z

Which issue does this PR close?

Closes Parquet Use U64 Instead of Usize #7238

Updated PR of #7252

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Yes, breaking change as the public API now uses u64 instead of usize.

Signed-off-by: Arpit Bandejiya <[email protected]>

etseidl

Looking good. Thanks for pushing this forward. Just one little nit.

parquet/src/arrow/async_reader/metadata.rs

alamb · 2025-04-03T20:38:30Z

@tustvold since you requested the work in #7238 perhaps you have time to review this PR quickly to see if it does what you expect

alamb · 2025-04-07T13:58:36Z

I am in the process of preparing the 55 release. Are we happy with this PR? Or should we wait until the next major breaking change?

kylebarron · 2025-04-07T14:25:36Z

I'm happy to go either way on #7371 (comment) comment.

I would love to get this in so that wasm32 can read >4gb files.

alamb

I hit a few APIs while testing upgrade to DataFusion

Upgrade to arrow/parquet 55, and object_store to 0.12.0 and pyo3 to 0.24.0 datafusion#15466

Specifically: https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataReader.html

And the functions like try_parse etc (anything with a file size)

Also https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.MetadataFetch.html

alamb · 2025-04-07T15:11:58Z

I'm happy to go either way on #7371 (comment) comment.

I would love to get this in so that wasm32 can read >4gb files.

I agree -- let's try and get it in!

alamb · 2025-04-07T15:42:43Z

I'll give this a review in a few minutes

alamb · 2025-04-07T16:18:04Z

Also AsyncFileReader

kylebarron · 2025-04-07T16:26:53Z

Also AsyncFileReader

Are you mentioning this as a suggestion for this PR or as a note for what to update in DataFusion?

alamb · 2025-04-07T16:32:16Z

I played around with this a bit -- there are quite a few APIs that need to be updated... I'll make a PR shortly. Not sure if we want to do it at this point right before the release

alamb · 2025-04-07T16:38:02Z

Also AsyncFileReader

Are you mentioning this as a suggestion for this PR or as a note for what to update in DataFusion?

Sorry -- this was a note to myself of which APIs required the usize conversion during the datafusion upgrade

kylebarron · 2025-04-07T16:42:06Z

there are quite a few APIs that need to be updated

You mean remaining in the

And the functions like try_parse etc (anything with a file size)

I'm not sure I follow what you're saying here. ParquetMetaDataReader::try_parse uses ChunkReader, and that already uses u64 (in parquet 54). In particular, ChunkReader::get_read and ChunkReader::get_bytes already use u64 for the start parameter. I don't think the length parameter needs to use u64, because we'll never (I assume) be making individual reads of >4GB, we just need to know relatively locations in a file that could be >4GB.

This is a point in favor of rolling back the change above as @etseidl suggested, since we'll never presumably need a suffix fetch of 4GB, and we only must use u64 for offsets, which also reduces the amount of changes here.

kylebarron · 2025-04-07T16:47:19Z

This is a point in favor of rolling back the change above as @etseidl suggested, since we'll never presumably need a suffix fetch of 4GB, and we only must use u64 for offsets, which also reduces the amount of changes here.

I switched the suffix type back to usize as I agree we'll never be making suffix requests for larger than 2^32 bytes, and we should minimize the diff here for easiest adoption.

etseidl · 2025-04-07T16:50:40Z

I'm happy with rolling back the change, but I think the point about the file_size parameter is correct. If we need to seek to the end of a file larger than 4gb we'll need u64 for that.

Sorry I can't elaborate more now but I'm at the dentist 🦷.

kylebarron · 2025-04-07T16:51:52Z

I'm happy with rolling back the change, but I think the point about the file_size parameter is correct. If we need to seek to the end of a file larger than 4gb we'll need u64 for that.

Oh yes, I just noticed that myself. I'll change that now.

alamb · 2025-04-07T16:57:15Z

BTW I started a PR with what I think would really be required to support > 4GB files when usize is 32 bits:

Update more parquet APIs to use u64 kylebarron/arrow-rs#57

I think it is pretty invasive and maybe not a great idea for this upcoming release

tustvold · 2025-04-07T17:11:44Z

I'm afraid I don't really have time to deep-dive on this, but I would emphasise what other have pointed out, that a blanket find & replace of usize with u64 is NOT what we should do. There is no point switching to u64 for quantities that are either already in memory, e.g. offsets into a Bytes, or are assumed to fit - e.g. the metadata footer, a column chunk, etc...

The actual impact should be limited to file offsets

kylebarron · 2025-04-07T17:31:33Z

Sigh. At least some of the places of code @alamb identified in kylebarron#57 I think are valid, at least changing file_size to u64 and changing column_index_range to be a Range<u64>.

Assuming that you want the release candidate to go out today I'm not sure we have time to fix this today.

alamb · 2025-04-07T17:31:50Z

The actual impact should be limited to file offsets

I was thinking the change also needs to impact anything that is used to compute a file offset (like lengths, for example)?

tustvold · 2025-04-07T17:44:24Z

I was thinking the change also needs to impact anything that is used to compute a file offset (like lengths, for example)?

It depends on what the length represents, if it is the length of some quantity that is expected to fit into memory, usize is perfectly valid (and arguably more correct). If it is the length of the file in its entirety then yes that needs to be u64.

I would not expect to be changing metadata size or row group size quantities to be u64, for example. Ultimately if you have an API to fetch n bytes of data, making that a u64 only potentially changes an integer out of range error to an out of memory error.

alamb · 2025-04-07T18:05:32Z

Assuming that you want the release candidate to go out today I'm not sure we have time to fix this today.

I don't think we need to rush it out if we can get something ready to go

Alternately, we can also postpone this set of API changes until later in the summer (e.g. arrow and take our time

I don't think I'll have time to play around with it today, as I have some other things to attend to. I'll check back in tomorrow.

etseidl

Hoping we can still get this in. I just flagged a few places where there could be truncation going from u64 to usize (there may be more).

parquet/src/file/metadata/reader.rs

parquet/src/arrow/async_reader/metadata.rs

parquet/src/arrow/async_reader/mod.rs

parquet/src/arrow/async_reader/metadata.rs

parquet/src/arrow/async_reader/mod.rs

kylebarron · 2025-04-07T19:05:07Z

I don't think I'll have time to finish this this afternoon. @etseidl you're welcome to push this over the finish line if you have time. It looks like one of my changes from 6eb9484 (#7371) broke some tests too

kylebarron · 2025-04-07T20:51:44Z

parquet/src/file/metadata/reader.rs

@@ -658,13 +658,13 @@ impl ParquetMetaDataReader {

        // Did not fetch the entire file metadata in the initial read, need to make a second request
        if length > suffix_len - FOOTER_SIZE {
-            let metadata_start = file_size - (length - FOOTER_SIZE) as u64;
+            let metadata_start = file_size - (length + FOOTER_SIZE) as u64;


Ah, yes I was going through this too quickly when adding the parentheses

etseidl · 2025-04-08T03:59:03Z

@alamb please take a look now. I think I've hit all the important points from kylebarron#57. I believe the only thing left would be possibly changing the metadata length field to u32.

alamb

Thank you @kylebarron and @etseidl -- I went through this and I think it looks like a good change.

The only thing I think we need to do is revert the change to the deprecated APIs -- I will do this shortly.

I'll also test this change downstream in DataFusion and see if there are any other APIs we should update, but I think this got all the big ones.

parquet/src/arrow/async_reader/metadata.rs

alamb · 2025-04-08T12:58:23Z

parquet/src/arrow/async_reader/mod.rs

@@ -80,10 +80,10 @@ pub use store::*;
 /// [`tokio::fs::File`]: https://docs.rs/tokio/latest/tokio/fs/struct.File.html
 pub trait AsyncFileReader: Send {
    /// Retrieve the bytes in `range`
-    fn get_bytes(&mut self, range: Range<usize>) -> BoxFuture<'_, Result<Bytes>>;
+    fn get_bytes(&mut self, range: Range<u64>) -> BoxFuture<'_, Result<Bytes>>;


alamb · 2025-04-08T12:59:23Z

parquet/src/arrow/async_reader/store.rs

@@ -325,7 +320,7 @@ mod tests {
        let initial_actions = num_actions.load(Ordering::Relaxed);

        let reader = ParquetObjectReader::new(store, meta.location)
-            .with_file_size(meta.size.try_into().unwrap())
+            .with_file_size(meta.size)


a good example of how aligning these APIs makes things easier to understand

alamb · 2025-04-08T13:06:47Z

@alamb please take a look now. I think I've hit all the important points from kylebarron#57. I believe the only thing left would be possibly changing the metadata length field to u32.

In my opinion, supporting metadata larger than 4GB, while admirable from a completeness point of view, is likely not a super important usecase so I think keeping the API churn down is probably a good idea.

kylebarron · 2025-04-08T14:04:22Z

In my opinion, supporting metadata larger than 4GB, while admirable from a completeness point of view, is likely not a super important usecase so I think keeping the API churn down is probably a good idea.

I think @etseidl 's point here is that since the metadata length has to fit in the last 4 bytes of the file, the Parquet spec doesn't support metadata larger than u32 anyways.

etseidl · 2025-04-08T14:22:08Z

In my opinion, supporting metadata larger than 4GB, while admirable from a completeness point of view, is likely not a super important usecase so I think keeping the API churn down is probably a good idea.

I think @etseidl 's point here is that since the metadata length has to fit in the last 4 bytes of the file, the Parquet spec doesn't support metadata larger than u32 anyways.

Yes. But it does add some unnecessary thrash since that field is only ever initialized via a 4 byte array. No harm in leaving it as usize.

etseidl

Thanks to all involved! This one took a village :)

alamb · 2025-04-08T14:43:07Z

I tested this PR out downstream in DataFusion and it cleaned up several of the rough edges:

WIP / test parquet API changes datafusion#15637

In particular this commit shows the improvements (avoid having to translate Range --> Range in the AsyncReader traits)

apache/datafusion@a708291

alamb · 2025-04-08T14:48:21Z

Onwards!

alamb · 2025-04-08T14:48:32Z

Thanks again @kylebarron @etseidl and @tustvold

alchemist51 and others added 5 commits March 15, 2025 13:35

Change AsyncFileReader trait for u64

11aa2cb

Signed-off-by: Arpit Bandejiya <[email protected]>

update metadatafetch trait

fef3719

Signed-off-by: Arpit Bandejiya <[email protected]>

Fix lint issue

917b0e6

Merge branch 'main' into change-parquet-usize

87e9efc

fix tests for latest main

6e70d20

github-actions bot added the parquet Changes to the parquet crate label Apr 1, 2025

Address comments by @mbrobbel from apache#7252

2adf76c

kylebarron mentioned this pull request Apr 1, 2025

Change Parquet API interaction for u64 #7252

Closed

etseidl reviewed Apr 2, 2025

View reviewed changes

parquet/src/arrow/async_reader/metadata.rs Outdated Show resolved Hide resolved

alamb requested a review from tustvold April 3, 2025 20:38

alamb reviewed Apr 7, 2025

View reviewed changes

alamb added the api-change Changes to the arrow API label Apr 7, 2025

kylebarron added 2 commits April 7, 2025 12:24

Merge branch 'main' into kyle/change-parquet-u64

583dfdf

Fix compile

c06f60d

Revert suffix length back to usize

494ce40

alamb mentioned this pull request Apr 7, 2025

Update more parquet APIs to use u64 kylebarron/arrow-rs#57

Draft

Use u64 for file_size

6eb9484

etseidl reviewed Apr 7, 2025

View reviewed changes

address comments

16bc64b

etseidl added 2 commits April 7, 2025 13:28

fix calculation of metadata_start

78c165a

change file_size type to u64

5742fff

kylebarron commented Apr 7, 2025

View reviewed changes

etseidl added 4 commits April 7, 2025 14:41

change _sized functions to take u64 file_size

2ad9c85

clippy

176cc31

remove some potential panics

d8b6061

use u64 for page index ranges

3cf8d8d

alamb changed the title ~~Change Parquet API interaction to use u64~~ Change Parquet API interaction to use u64 (support files larger than 4GB in WASM) Apr 8, 2025

alamb approved these changes Apr 8, 2025

View reviewed changes

Revert change to deprecated method

6664425

etseidl approved these changes Apr 8, 2025

View reviewed changes

alamb mentioned this pull request Apr 8, 2025

(TESTING / not for Merge) Test Parquet API changes #7397

Closed

alamb merged commit 474f192 into apache:main Apr 8, 2025
16 checks passed

Change Parquet API interaction to use u64 (support files larger than 4GB in WASM) #7371

Change Parquet API interaction to use u64 (support files larger than 4GB in WASM) #7371

Conversation

kylebarron commented Apr 1, 2025 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

etseidl left a comment

Choose a reason for hiding this comment

alamb commented Apr 3, 2025

alamb commented Apr 7, 2025

kylebarron commented Apr 7, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb commented Apr 7, 2025

alamb commented Apr 7, 2025

alamb commented Apr 7, 2025

kylebarron commented Apr 7, 2025

alamb commented Apr 7, 2025

alamb commented Apr 7, 2025

kylebarron commented Apr 7, 2025

kylebarron commented Apr 7, 2025

etseidl commented Apr 7, 2025

kylebarron commented Apr 7, 2025

alamb commented Apr 7, 2025 • edited Loading

tustvold commented Apr 7, 2025

kylebarron commented Apr 7, 2025

alamb commented Apr 7, 2025

tustvold commented Apr 7, 2025 • edited Loading

alamb commented Apr 7, 2025

etseidl left a comment

Choose a reason for hiding this comment

kylebarron commented Apr 7, 2025

kylebarron Apr 7, 2025

Choose a reason for hiding this comment

etseidl commented Apr 8, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 8, 2025

Choose a reason for hiding this comment

alamb Apr 8, 2025

Choose a reason for hiding this comment

alamb commented Apr 8, 2025

kylebarron commented Apr 8, 2025

etseidl commented Apr 8, 2025

etseidl left a comment

Choose a reason for hiding this comment

alamb commented Apr 8, 2025 • edited Loading

alamb commented Apr 8, 2025

alamb commented Apr 8, 2025

Change Parquet API interaction to use `u64` (support files larger than 4GB in WASM) #7371

Change Parquet API interaction to use `u64` (support files larger than 4GB in WASM) #7371

kylebarron commented Apr 1, 2025 •

edited by alamb

Loading

alamb commented Apr 7, 2025 •

edited

Loading

tustvold commented Apr 7, 2025 •

edited

Loading

alamb commented Apr 8, 2025 •

edited

Loading