-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@JanKaul reports via Discord:
In the parquet metadata for column chunks is a field for distinct_counts, it is currently not populated or maybe just for dictionary columns. Distinct count statistics play an important role for join order selection, so it would be very good to provide that to a query engine. However, calculating distinct counts is very expensive and probably the reason why it is not done for most parquet writers.
The distinct_count
field is defined here:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L292-L293
The reason this is not written at this time is that computing distinct counts for columns can be quite expensive depending on the type and the datatype (e.g. potentially has all the values, keep track of them, etc)
Describe the solution you'd like
Allow the Rust parquet writers to populate this field somehow
Describe alternatives you've considered
- Implement some basic implementation in the writer that is optional (and off by default). Careful memory management (and reporting / limiting) is probably critical
- Implement an API / callback for populating the statistics -- aka require the user code manage the gathering / fallback
I would suggest personally:
- Built in distinct statistics, enablable per column (as distinct counts are much more important for some columns)
- Add a memory limit for computing the distinct count, and if that is exceeded stop capturing statistics and write the data without the stats
Additional context