-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to pass arbitrary pages to SerializedPageReader #5940
Comments
Seems reasonable to me -- the key would be to add the API and document it sufficiently so it isn't hard I believe this idea is similar to the APIs provided in https://github.com/jorgecarleitao/parquet2 (now unmaintained) which might be interesting to look at for inspiration |
Thanks! I also learnt the hard way that |
Is that a known bug in some specific parquet writer? Would be really unexpected since there is a separate |
I think that there was a bug with the Java's parquet-mr impl a while back: https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding At least for me, it wasn't easy to find this info and it was very confusing. Maybe worth putting this in a document somewhere in the the parquet writer at least for the Rust implementation? @alamb . |
Thank you, that is good to know! I'm continuously surprised how many of these edge cases lurk in such a standardized format. |
Late to this discussion, but I tripped over the same bug a few years back when trying to implement page skipping in cuDF. Not fun. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We want ability to read an arbitrary page in a column chunk,
SerializedPageReader
has almost everything we need to. The only catch is that there is an implicit constraint that we have to pass the first data page to thepage_locations
argument of thenew_with_properties
constructor. This is fine but it makes working with this reader less ergonomic (you have to skip a page to get to the page you actually want to read).Describe the solution you'd like
page_locations
argument. Internally, we can read page_index (if available) to infer the dictionary page size;Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: