Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 31, 2025

Which issue does this PR close?

Rationale for this change

This is a first attempt at an object to help control the parsing of the Parquet metadata.

What changes are included in this PR?

Adds a new MetadataOptions struct, and plumbs it down into the Thrift decoder code. The only option for now is to pass in a schema, which then causes the decoder to skip decoding the schema contained in the footer.

Also adds to the metadata bench to demonstrate the time savings from reusing the schema.

Are these changes tested?

Yes, adds a new test.

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 31, 2025

Here's an excerpt from a run of the new benchmark that shows the schema is actually skipped.

decode parquet metadata time:   [14.401 µs 14.436 µs 14.475 µs]
decode metadata with schema
                        time:   [7.1264 µs 7.1461 µs 7.1677 µs]
decode parquet metadata (wide)
                        time:   [48.440 ms 48.828 ms 49.445 ms]
decode metadata (wide) with schema
                        time:   [43.212 ms 43.452 ms 43.793 ms]

This should get even faster with the metadata index (#8714)

Comment on lines +370 to +371
.with_column_index_policy(self.column_index)
.with_metadata_options(self.metadata_options.clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I could see moving the page index policy into the MetadataOptions and then deprecating a bunch of setters.

@etseidl etseidl changed the title Metadata options Add options to control various aspects of Parquet metadata decoding Oct 31, 2025
// the credentials and keys needed to decrypt metadata
file_decryption_properties: Option<Arc<FileDecryptionProperties>>,
// metadata parsing options
metadata_options: Option<MetadataOptions>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this should be Option<Arc<MetadataOptions>> everywhere.

@etseidl
Copy link
Contributor Author

etseidl commented Nov 1, 2025

This may help with #5999

/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder
#[derive(Default, Debug, Clone)]
pub struct MetadataOptions {
schema_descr: Option<SchemaDescPtr>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this means (1) User provided schema or (2) only (min, max, etc) columns in schema_descr be decoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a ticket that explains the use case a bit more;

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API looks good to me (and actually closes an existing ticket)

/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder
#[derive(Default, Debug, Clone)]
pub struct MetadataOptions {
schema_descr: Option<SchemaDescPtr>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a ticket that explains the use case a bit more;

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

(I didn't approve it b/c it is still marked as a draft)

@etseidl
Copy link
Contributor Author

etseidl commented Nov 5, 2025

(I didn't approve it b/c it is still marked as a draft)

Thanks @alamb. I'm still messing around with the API. I think I like hiding the new options object in the file reader APIs, and just exposing it for the metadata readers. The last wrinkle is figuring out a good way to share across the ParquetMetaDataReader, ParquetMetaDataPushDecoder, and MetadataParser, with an eye towards sucking the page index policy stuff into the new options object.

@etseidl etseidl marked this pull request as ready for review November 5, 2025 21:35
@etseidl
Copy link
Contributor Author

etseidl commented Nov 5, 2025

Ok, I think this is ready now.

Right now I'm mildly against pulling in the page index policies. They are used at a higher level and I don't think it's worth the thrash to move them. Instead I want to focus on options that impact the FileMetaData parsing (skip stats, transform page encoding stats, etc), and then work in the metadata index to further accelerate the decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants