-
Notifications
You must be signed in to change notification settings - Fork 1k
Add options to control various aspects of Parquet metadata decoding #8763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Here's an excerpt from a run of the new benchmark that shows the schema is actually skipped. This should get even faster with the metadata index (#8714) |
| .with_column_index_policy(self.column_index) | ||
| .with_metadata_options(self.metadata_options.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I could see moving the page index policy into the MetadataOptions and then deprecating a bunch of setters.
parquet/src/file/metadata/parser.rs
Outdated
| // the credentials and keys needed to decrypt metadata | ||
| file_decryption_properties: Option<Arc<FileDecryptionProperties>>, | ||
| // metadata parsing options | ||
| metadata_options: Option<MetadataOptions>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this should be Option<Arc<MetadataOptions>> everywhere.
|
This may help with #5999 |
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this means (1) User provided schema or (2) only (min, max, etc) columns in schema_descr be decoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a ticket that explains the use case a bit more;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this API looks good to me (and actually closes an existing ticket)
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a ticket that explains the use case a bit more;
|
(I didn't approve it b/c it is still marked as a draft) |
Thanks @alamb. I'm still messing around with the API. I think I like hiding the new options object in the file reader APIs, and just exposing it for the metadata readers. The last wrinkle is figuring out a good way to share across the |
|
Ok, I think this is ready now. Right now I'm mildly against pulling in the page index policies. They are used at a higher level and I don't think it's worth the thrash to move them. Instead I want to focus on options that impact the |
Which issue does this PR close?
Rationale for this change
This is a first attempt at an object to help control the parsing of the Parquet metadata.
What changes are included in this PR?
Adds a new
MetadataOptionsstruct, and plumbs it down into the Thrift decoder code. The only option for now is to pass in a schema, which then causes the decoder to skip decoding the schema contained in the footer.Also adds to the metadata bench to demonstrate the time savings from reusing the schema.
Are these changes tested?
Yes, adds a new test.
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.