Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] write file statistics during appends #787

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zachschuermann
Copy link
Collaborator

What changes are proposed in this pull request?

Warning

this is wip. do not merge. opening to gather some feedback on a path forward on gathering stats during writes.

Note

TLDR: presents a rough stats prototype but the full implementation is currently blocked on arrow's support for nested column support in the StatisticsConverter (or rather parquet_column) API.

Currently we don't write stats at all during our appends. in order to allow for skipping in the read path it's important that we support stats soon. In order to support collecting statistics we will need (1) kernel changes and (2) default engine changes.

  1. the kernel changes are relatively straightforward: we need to add a new stats column to the write_metadata_schema. but note that since the stats column has the same schema as the data files, we must now compute the schema on a per-table basis. that is, instead of a static Transaction::get_write_metadata_schema(), we will have a method on each transaction to fetch the write metadata schema for a given table: Transaction::write_metadata_schema(&self) (and will remove the 'get')
  2. the default engine is where the meat of stats computation lives. This PR includes a prototype of the default engine stats implementation. The flow is as follows:
    a. get the format::FileMetaData returned from the parquet write. note there is some confusing naming overlap between two modules: parquet::format::FileMetaData and parquet::file::FileMetaData. The latter (the parquet::file module) is supposed to be a higher-level module for consumption.
    b. parse into a Vec<RowGroupMetaData> (see RowGroupMetaData)
    c. for each leaf column, create a StatisticsConverter to fetch/parse the min/max/null-counts (this is done per-row-group, then we aggregate into a single value using arrow::compute::min/max)
    d. after we have a list of scalar values representing statistics, we can leverage the new create_one API to actually create the stats data to be unioned with the other write metadata to pass back to the transaction.

The existing PR implement a rough prototype for parts a, b, and c above. Unfortunately the full implementation is currently blocked on arrow's support for nested column stats. Specifically, whenever one attempts to create a StatisticsConverter for nested columns, you will just get back null stats even if they are present. This seems to be due to a lack of nested column support in parquet_column.

In order to move forward we can either:

  1. [preferred]: work together with arrow-rs to improve parquet_column to gain nested column support and build out full stats support in kernel
  2. only support non-nested: would likely be introducing a non-trivial amount of code to force non-nested and/or fill nulls for nested fields, not sure if it's worth pursuing given more limited utility
  3. do a read: probably the easiest to implement immediately but this requires additional IO (blocked by first write IO) just for stats - need to further investigate the ability to fetch just the footer.

This PR highlights a few other places to investigate:

  1. no nested column support (discussed above)
  2. the aggregation from per-row-group statistics into file-level min/max just yields native (rust) types, then we convert back into arrow - feels like a slight impedance mismatch but since this is always just going to be one row of data, doesn't seem worth pursuing any optimization (and actually having native data makes it easy to use the new create_one API)
  3. lastly, one piece I did not yet build out in this PR is the ability to embed the stats arrays into the table schema (to write to stats column in add actions). this can likely be achieved easily with the new create_one API by just passing in the table schema and the leaf values with the stats we just computed.

This PR affects the following public APIs

  1. breaking: removed Transaction::get_write_metadata() static function in favor of a method Transaction::write_metadata(&self)
  2. New 'stats' field used in write_metadata.

How was this change tested?

todo

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant