ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request #6476

alamb · 2024-09-28T10:34:00Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

ArrowReaderMetadata to read parquet files, and one major usecase is to supply pre-parsed metadata (to avoid a second object store request on read) by providing the ParquetMetaData to ArrowReaderMetadata::try_new

However, the way the API is currently setup it is easy to supply the ParquetMetaData but the reader will STILL make 2 object store requests.

This happens if the ArrowReaderOptions has with_page_index specified but the provided metadata doesn't (yet) have the page index, it will load it again

This is a common source of confusion / bugs: when someone supplies the ParquetMetaData to the ArrowReaderMetadata they are very often trying to avoid a second object store request, but as it often turns out the second fetch happens anyways to read the page index (thus obviating the attempt at optimization)

This is (in a roundabout way) what is happening to @progval in apache/datafusion#12593 and it took me a while to debug what was happening while working on the advanced_parquet_index.rs in DataFusion

Describe the solution you'd like
I would like the API to be harder to misuse.

Describe alternatives you've considered
For example, maybe we could make ArrowReaderMetadata error if it was supplied with ParquetMetaData that did not have the page indexes,

for example, we could add a ArrowReaderOptions::error_if_need_metadata or something that would change the automatic fetch/load behavior into an error if the reader needs the page index, and the file has a page index, but it isn't loaded yet into ParquetMetaData

Additional context

The text was updated successfully, but these errors were encountered:

doki23 · 2024-10-07T13:44:55Z

I propose another approach: we could cache the page index upon its initial load ( make user provided metadata mutable), as the current API seems satisfactory to me.

alamb · 2024-10-08T14:56:43Z

I propose another approach: we could cache the page index upon its initial load ( make user provided metadata mutable), as the current API seems satisfactory to me.

I am not quite sure what you mean by this -- depending on how you setup the initial load, the page index will be read. The issue I was trying to describe is if you don't read the page index in the initial load, a subsequent opening of the file even if you supply ParquetMetaData, will force another fetch of metadata (page index) prior to reading the data

doki23 · 2024-10-09T02:36:41Z

I propose another approach: we could cache the page index upon its initial load ( make user provided metadata mutable), as the current API seems satisfactory to me.

I am not quite sure what you mean by this -- depending on how you setup the initial load, the page index will be read. The issue I was trying to describe is if you don't read the page index in the initial load, a subsequent opening of the file even if you supply ParquetMetaData, will force another fetch of metadata (page index) prior to reading the data

Thank you, get it.

etseidl · 2025-03-31T20:41:26Z

@adamreeve recently jogged my memory of this issue (#7342 (comment)). I think the recent changes to the API (#6637, #7334, #7342) make fixing this finally possible. I'll take a stab at it this week.

alamb · 2025-03-31T23:59:13Z

Thanks @etseidl -- that would be great

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Sep 28, 2024

alamb mentioned this issue Sep 28, 2024

Deprecate MetadataLoader #6474

Merged

etseidl mentioned this issue Apr 1, 2025

Clean up ArrowReaderMetadata::load_async #7369

Merged

alamb closed this as completed in #7369 Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request #6476

ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request #6476

alamb commented Sep 28, 2024

doki23 commented Oct 7, 2024 •

edited

Loading

alamb commented Oct 8, 2024

doki23 commented Oct 9, 2024

etseidl commented Mar 31, 2025

alamb commented Mar 31, 2025

ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request #6476

ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request #6476

Comments

alamb commented Sep 28, 2024

doki23 commented Oct 7, 2024 • edited Loading

alamb commented Oct 8, 2024

doki23 commented Oct 9, 2024

etseidl commented Mar 31, 2025

alamb commented Mar 31, 2025

doki23 commented Oct 7, 2024 •

edited

Loading