-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: POC for handling struct child via StatisticsConverter #7365
base: main
Are you sure you want to change the base?
Conversation
.build() | ||
.unwrap(), | ||
); | ||
let path = "release/2025-02-19.0/theme=addresses/type=address/part-00010-e084a2d7-fea9-41e5-a56f-e638a3307547-c000.zstd.parquet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is public data from the Overture project conforming to the GeoParquet specification. That defines a struct column, canonically called bbox
, which contains the fields xmin
, ymin
, xmax
, ymax
. This allows for row group filtering for spatially-sorted data by inferring the spatial bounds of each row group from the statistics.
In order to use the Arrow statistics converter, which is way simpler than handling the statistics manually, we would need some way to access these struct columns.
for part in column.parts() { | ||
if let Some((_idx, inner_arrow_field)) = fields.find(part) { | ||
match inner_arrow_field.data_type() { | ||
DataType::Struct(inner_fields) => { | ||
fields = inner_fields; | ||
} | ||
_ => { | ||
arrow_field = Some(inner_arrow_field); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core change: recursively look into the arrow field's children if the field is a struct
let mask = | ||
ProjectionMask::columns(parquet_schema, std::iter::once(column.string().as_str())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And use ProjectionMask::columns
, which handles searching for nested Parquet fields
Which issue does this PR close?
Closes #7364. Also related to apache/datafusion#10609 and apache/datafusion#8334.
Rationale for this change
Support some way to handle struct-typed columns in
StatisticsConverter
.What changes are included in this PR?
I think there are two ways to handle this:
row_group_mins
would return a struct-typed Arrow array.StatisticsConverter::try_new
to handle a nested column name. So the user would passa.b.c
, which designates a primitive type, and we reuse existing converters. Sorow_group_mins
would return a primitive-typed Arrow array.This PR prototypes approach 2.
In principle both approaches would be good to have, but approach 1 looked more complex, and approach 2 at least provides a valid workaround.
Are there any user-facing changes?
Yes, a breaking change to the signature of
StatisticsConverter
to useColumnPath
instead of&str
for thecolumn_name
. This isn't technically required; we could assume that a.
in the column name is a field delimiter, but this seems unnecessary whenColumnPath
already exists as a type.(Similarly, it's weird to me that
ProjectionMask::columns
takes in&str
and not&ColumnPath
as input)