Implement (optional) distinct count population in Parquet statistics

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
@JanKaul  reports via Discord:

> In the parquet metadata for column chunks is a field for distinct_counts, it is currently not populated or maybe just for dictionary columns. Distinct count statistics play an important role for join order selection, so it would be very good to provide that to a query engine. However, calculating distinct counts is very expensive and probably the reason why it is not done for most parquet writers.

The `distinct_count` field is defined here:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L292-L293

The reason this is not written at this time is that computing distinct counts for columns can be quite expensive depending on the type and the datatype (e.g. potentially has all the values, keep track of them, etc)

**Describe the solution you'd like**
Allow the Rust parquet writers to populate this field somehow


**Describe alternatives you've considered**
1. Implement some basic implementation in the writer that is optional (and off by default). Careful memory management (and reporting / limiting) is probably critical
2. Implement an API / callback for populating the statistics -- aka require the user code manage the gathering / fallback

I would suggest personally:
1. Built in distinct statistics, enablable per column (as distinct counts are much more important for some columns)
2. Add a memory limit for computing the distinct count, and if that is exceeded stop capturing statistics and write the data without the stats


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement (optional) distinct count population in Parquet statistics #8608

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement (optional) distinct count population in Parquet statistics #8608

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions