Parquet: POC for handling struct child via StatisticsConverter#7365
Parquet: POC for handling struct child via StatisticsConverter#7365kylebarron wants to merge 3 commits intoapache:mainfrom
Conversation
| .build() | ||
| .unwrap(), | ||
| ); | ||
| let path = "release/2025-02-19.0/theme=addresses/type=address/part-00010-e084a2d7-fea9-41e5-a56f-e638a3307547-c000.zstd.parquet"; |
There was a problem hiding this comment.
This is public data from the Overture project conforming to the GeoParquet specification. That defines a struct column, canonically called bbox, which contains the fields xmin, ymin, xmax, ymax. This allows for row group filtering for spatially-sorted data by inferring the spatial bounds of each row group from the statistics.
In order to use the Arrow statistics converter, which is way simpler than handling the statistics manually, we would need some way to access these struct columns.
| for part in column.parts() { | ||
| if let Some((_idx, inner_arrow_field)) = fields.find(part) { | ||
| match inner_arrow_field.data_type() { | ||
| DataType::Struct(inner_fields) => { | ||
| fields = inner_fields; | ||
| } | ||
| _ => { | ||
| arrow_field = Some(inner_arrow_field); | ||
| } | ||
| } |
There was a problem hiding this comment.
The core change: recursively look into the arrow field's children if the field is a struct
| let mask = | ||
| ProjectionMask::columns(parquet_schema, std::iter::once(column.string().as_str())); |
There was a problem hiding this comment.
And use ProjectionMask::columns, which handles searching for nested Parquet fields
Which issue does this PR close?
Closes #7364. Also related to apache/datafusion#10609 and apache/datafusion#8334.
Rationale for this change
Support some way to handle struct-typed columns in
StatisticsConverter.What changes are included in this PR?
I think there are two ways to handle this:
row_group_minswould return a struct-typed Arrow array.StatisticsConverter::try_newto handle a nested column name. So the user would passa.b.c, which designates a primitive type, and we reuse existing converters. Sorow_group_minswould return a primitive-typed Arrow array.This PR prototypes approach 2.
In principle both approaches would be good to have, but approach 1 looked more complex, and approach 2 at least provides a valid workaround.
Are there any user-facing changes?
Yes, a breaking change to the signature of
StatisticsConverterto useColumnPathinstead of&strfor thecolumn_name. This isn't technically required; we could assume that a.in the column name is a field delimiter, but this seems unnecessary whenColumnPathalready exists as a type.(Similarly, it's weird to me that
ProjectionMask::columnstakes in&strand not&ColumnPathas input)