Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: POC for handling struct child via StatisticsConverter #7365

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kylebarron
Copy link
Contributor

Which issue does this PR close?

Closes #7364. Also related to apache/datafusion#10609 and apache/datafusion#8334.

Rationale for this change

Support some way to handle struct-typed columns in StatisticsConverter.

What changes are included in this PR?

I think there are two ways to handle this:

  1. Support struct-type columns directly, where the user can pass the name of a struct column, and all children are automatically handled. So row_group_mins would return a struct-typed Arrow array.
  2. Adjust StatisticsConverter::try_new to handle a nested column name. So the user would pass a.b.c, which designates a primitive type, and we reuse existing converters. So row_group_mins would return a primitive-typed Arrow array.

This PR prototypes approach 2.

In principle both approaches would be good to have, but approach 1 looked more complex, and approach 2 at least provides a valid workaround.

Are there any user-facing changes?

Yes, a breaking change to the signature of StatisticsConverter to use ColumnPath instead of &str for the column_name. This isn't technically required; we could assume that a . in the column name is a field delimiter, but this seems unnecessary when ColumnPath already exists as a type.

(Similarly, it's weird to me that ProjectionMask::columns takes in &str and not &ColumnPath as input)

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 31, 2025
.build()
.unwrap(),
);
let path = "release/2025-02-19.0/theme=addresses/type=address/part-00010-e084a2d7-fea9-41e5-a56f-e638a3307547-c000.zstd.parquet";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is public data from the Overture project conforming to the GeoParquet specification. That defines a struct column, canonically called bbox, which contains the fields xmin, ymin, xmax, ymax. This allows for row group filtering for spatially-sorted data by inferring the spatial bounds of each row group from the statistics.

In order to use the Arrow statistics converter, which is way simpler than handling the statistics manually, we would need some way to access these struct columns.

Comment on lines +1301 to +1310
for part in column.parts() {
if let Some((_idx, inner_arrow_field)) = fields.find(part) {
match inner_arrow_field.data_type() {
DataType::Struct(inner_fields) => {
fields = inner_fields;
}
_ => {
arrow_field = Some(inner_arrow_field);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core change: recursively look into the arrow field's children if the field is a struct

Comment on lines +1324 to +1325
let mask =
ProjectionMask::columns(parquet_schema, std::iter::once(column.string().as_str()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And use ProjectionMask::columns, which handles searching for nested Parquet fields

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet StatisticsConverter does not work for struct columns
1 participant