Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to pass arbitrary pages to SerializedPageReader #5940

Open
trungda opened this issue Jun 22, 2024 · 6 comments · May be fixed by #7087
Open

Ability to pass arbitrary pages to SerializedPageReader #5940

trungda opened this issue Jun 22, 2024 · 6 comments · May be fixed by #7087
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@trungda
Copy link
Contributor

trungda commented Jun 22, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We want ability to read an arbitrary page in a column chunk, SerializedPageReader has almost everything we need to. The only catch is that there is an implicit constraint that we have to pass the first data page to the page_locations argument of the new_with_properties constructor. This is fine but it makes working with this reader less ergonomic (you have to skip a page to get to the page you actually want to read).

Describe the solution you'd like

  1. Lift the implicit constraint that we have to pass the first data page to the page_locations argument. Internally, we can read page_index (if available) to infer the dictionary page size;
  2. Add an example or a test on how to use this API.

Describe alternatives you've considered

Additional context

@trungda trungda added the enhancement Any new improvement worthy of a entry in the changelog label Jun 22, 2024
@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

Seems reasonable to me -- the key would be to add the API and document it sufficiently so it isn't hard

I believe this idea is similar to the APIs provided in https://github.com/jorgecarleitao/parquet2 (now unmaintained) which might be interesting to look at for inspiration

@trungda
Copy link
Contributor Author

trungda commented Jun 24, 2024

Thanks! I also learnt the hard way that data_page_offset doesn't necessarily point to the first data page (it could actually point to the dictionary page 🤷 ). The actual source of truth is in the page index.

@jhorstmann
Copy link
Contributor

I also learnt the hard way that data_page_offset doesn't necessarily point to the first data page

Is that a known bug in some specific parquet writer? Would be really unexpected since there is a separate dictionary_page_offset, which I think is guaranteed to point to before the first data page.

@trungda
Copy link
Contributor Author

trungda commented Jul 1, 2024

dictionary_page_offset

I think that there was a bug with the Java's parquet-mr impl a while back: https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding

At least for me, it wasn't easy to find this info and it was very confusing. Maybe worth putting this in a document somewhere in the the parquet writer at least for the Rust implementation? @alamb .

@jhorstmann
Copy link
Contributor

Thank you, that is good to know! I'm continuously surprised how many of these edge cases lurk in such a standardized format.

@etseidl
Copy link
Contributor

etseidl commented Feb 13, 2025

Thank you, that is good to know! I'm continuously surprised how many of these edge cases lurk in such a standardized format.

Late to this discussion, but I tripped over the same bug a few years back when trying to implement page skipping in cuDF. Not fun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants