[WIP] write file statistics during appends #787
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Warning
this is wip. do not merge. opening to gather some feedback on a path forward on gathering stats during writes.
Note
TLDR: presents a rough stats prototype but the full implementation is currently blocked on arrow's support for nested column support in the
StatisticsConverter
(or ratherparquet_column
) API.Currently we don't write stats at all during our appends. in order to allow for skipping in the read path it's important that we support stats soon. In order to support collecting statistics we will need (1) kernel changes and (2) default engine changes.
stats
column to thewrite_metadata_schema
. but note that since thestats
column has the same schema as the data files, we must now compute the schema on a per-table basis. that is, instead of a staticTransaction::get_write_metadata_schema()
, we will have a method on each transaction to fetch the write metadata schema for a given table:Transaction::write_metadata_schema(&self)
(and will remove the 'get')a. get the
format::FileMetaData
returned from the parquet write. note there is some confusing naming overlap between two modules:parquet::format::FileMetaData
andparquet::file::FileMetaData
. The latter (theparquet::file
module) is supposed to be a higher-level module for consumption.b. parse into a
Vec<RowGroupMetaData>
(seeRowGroupMetaData
)c. for each leaf column, create a
StatisticsConverter
to fetch/parse the min/max/null-counts (this is done per-row-group, then we aggregate into a single value usingarrow::compute::min/max
)d. after we have a list of scalar values representing statistics, we can leverage the new
create_one
API to actually create the stats data to be unioned with the other write metadata to pass back to the transaction.The existing PR implement a rough prototype for parts a, b, and c above. Unfortunately the full implementation is currently blocked on arrow's support for nested column stats. Specifically, whenever one attempts to create a
StatisticsConverter
for nested columns, you will just get back null stats even if they are present. This seems to be due to a lack of nested column support inparquet_column
.In order to move forward we can either:
parquet_column
to gain nested column support and build out full stats support in kernelThis PR highlights a few other places to investigate:
create_one
API)stats
column in add actions). this can likely be achieved easily with the newcreate_one
API by just passing in the table schema and the leaf values with the stats we just computed.This PR affects the following public APIs
Transaction::get_write_metadata()
static function in favor of a methodTransaction::write_metadata(&self)
write_metadata
.How was this change tested?
todo