Skip to content

Comments

Parquet: POC for handling struct child via StatisticsConverter#7365

Open
kylebarron wants to merge 3 commits intoapache:mainfrom
kylebarron:kyle/struct-statistics
Open

Parquet: POC for handling struct child via StatisticsConverter#7365
kylebarron wants to merge 3 commits intoapache:mainfrom
kylebarron:kyle/struct-statistics

Conversation

@kylebarron
Copy link
Member

Which issue does this PR close?

Closes #7364. Also related to apache/datafusion#10609 and apache/datafusion#8334.

Rationale for this change

Support some way to handle struct-typed columns in StatisticsConverter.

What changes are included in this PR?

I think there are two ways to handle this:

  1. Support struct-type columns directly, where the user can pass the name of a struct column, and all children are automatically handled. So row_group_mins would return a struct-typed Arrow array.
  2. Adjust StatisticsConverter::try_new to handle a nested column name. So the user would pass a.b.c, which designates a primitive type, and we reuse existing converters. So row_group_mins would return a primitive-typed Arrow array.

This PR prototypes approach 2.

In principle both approaches would be good to have, but approach 1 looked more complex, and approach 2 at least provides a valid workaround.

Are there any user-facing changes?

Yes, a breaking change to the signature of StatisticsConverter to use ColumnPath instead of &str for the column_name. This isn't technically required; we could assume that a . in the column name is a field delimiter, but this seems unnecessary when ColumnPath already exists as a type.

(Similarly, it's weird to me that ProjectionMask::columns takes in &str and not &ColumnPath as input)

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 31, 2025
.build()
.unwrap(),
);
let path = "release/2025-02-19.0/theme=addresses/type=address/part-00010-e084a2d7-fea9-41e5-a56f-e638a3307547-c000.zstd.parquet";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is public data from the Overture project conforming to the GeoParquet specification. That defines a struct column, canonically called bbox, which contains the fields xmin, ymin, xmax, ymax. This allows for row group filtering for spatially-sorted data by inferring the spatial bounds of each row group from the statistics.

In order to use the Arrow statistics converter, which is way simpler than handling the statistics manually, we would need some way to access these struct columns.

Comment on lines +1301 to +1310
for part in column.parts() {
if let Some((_idx, inner_arrow_field)) = fields.find(part) {
match inner_arrow_field.data_type() {
DataType::Struct(inner_fields) => {
fields = inner_fields;
}
_ => {
arrow_field = Some(inner_arrow_field);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core change: recursively look into the arrow field's children if the field is a struct

Comment on lines +1324 to +1325
let mask =
ProjectionMask::columns(parquet_schema, std::iter::once(column.string().as_str()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And use ProjectionMask::columns, which handles searching for nested Parquet fields

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet StatisticsConverter does not work for struct columns

1 participant