[WIP] write file statistics during appends #787

zachschuermann · 2025-03-31T22:16:52Z

What changes are proposed in this pull request?

Warning

this is wip. do not merge. opening to gather some feedback on a path forward on gathering stats during writes.

Note

TLDR: presents a rough stats prototype but the full implementation is currently blocked on arrow's support for nested column support in the StatisticsConverter (or rather parquet_column) API.

Currently we don't write stats at all during our appends. in order to allow for skipping in the read path it's important that we support stats soon. In order to support collecting statistics we will need (1) kernel changes and (2) default engine changes.

the kernel changes are relatively straightforward: we need to add a new stats column to the write_metadata_schema. but note that since the stats column has the same schema as the data files, we must now compute the schema on a per-table basis. that is, instead of a static Transaction::get_write_metadata_schema(), we will have a method on each transaction to fetch the write metadata schema for a given table: Transaction::write_metadata_schema(&self) (and will remove the 'get')
the default engine is where the meat of stats computation lives. This PR includes a prototype of the default engine stats implementation. The flow is as follows:
a. get the format::FileMetaData returned from the parquet write. note there is some confusing naming overlap between two modules: parquet::format::FileMetaData and parquet::file::FileMetaData. The latter (the parquet::file module) is supposed to be a higher-level module for consumption.
b. parse into a Vec<RowGroupMetaData> (see RowGroupMetaData)
c. for each leaf column, create a StatisticsConverter to fetch/parse the min/max/null-counts (this is done per-row-group, then we aggregate into a single value using arrow::compute::min/max)
d. after we have a list of scalar values representing statistics, we can leverage the new create_one API to actually create the stats data to be unioned with the other write metadata to pass back to the transaction.

The existing PR implement a rough prototype for parts a, b, and c above. Unfortunately the full implementation is currently blocked on arrow's support for nested column stats. Specifically, whenever one attempts to create a StatisticsConverter for nested columns, you will just get back null stats even if they are present. This seems to be due to a lack of nested column support in parquet_column.

In order to move forward we can either:

[preferred]: work together with arrow-rs to improve parquet_column to gain nested column support and build out full stats support in kernel
only support non-nested: would likely be introducing a non-trivial amount of code to force non-nested and/or fill nulls for nested fields, not sure if it's worth pursuing given more limited utility
do a read: probably the easiest to implement immediately but this requires additional IO (blocked by first write IO) just for stats - need to further investigate the ability to fetch just the footer.

This PR highlights a few other places to investigate:

no nested column support (discussed above)
the aggregation from per-row-group statistics into file-level min/max just yields native (rust) types, then we convert back into arrow - feels like a slight impedance mismatch but since this is always just going to be one row of data, doesn't seem worth pursuing any optimization (and actually having native data makes it easy to use the new create_one API)
lastly, one piece I did not yet build out in this PR is the ability to embed the stats arrays into the table schema (to write to stats column in add actions). this can likely be achieved easily with the new create_one API by just passing in the table schema and the leaf values with the stats we just computed.

This PR affects the following public APIs

breaking: removed Transaction::get_write_metadata() static function in favor of a method Transaction::write_metadata(&self)
New 'stats' field used in write_metadata.

How was this change tested?

todo

zachschuermann added 2 commits March 26, 2025 18:24

stats for writes

04e4320

statsconverter

2c4520c

github-actions bot assigned zachschuermann Mar 31, 2025

zachschuermann requested review from nicklan and scovich March 31, 2025 22:17

This was referenced Mar 31, 2025

[tracking issue] delta kernel write path #377

Open

write file statistics during appends #788

Open

github-actions bot added the breaking-change Change that will require a version bump label Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] write file statistics during appends #787

[WIP] write file statistics during appends #787

zachschuermann commented Mar 31, 2025

[WIP] write file statistics during appends #787

Are you sure you want to change the base?

[WIP] write file statistics during appends #787

Conversation

zachschuermann commented Mar 31, 2025

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?