Skip to content

Implement (optional) distinct count population in Parquet statistics #8608

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@JanKaul reports via Discord:

In the parquet metadata for column chunks is a field for distinct_counts, it is currently not populated or maybe just for dictionary columns. Distinct count statistics play an important role for join order selection, so it would be very good to provide that to a query engine. However, calculating distinct counts is very expensive and probably the reason why it is not done for most parquet writers.

The distinct_count field is defined here:
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L292-L293

The reason this is not written at this time is that computing distinct counts for columns can be quite expensive depending on the type and the datatype (e.g. potentially has all the values, keep track of them, etc)

Describe the solution you'd like
Allow the Rust parquet writers to populate this field somehow

Describe alternatives you've considered

  1. Implement some basic implementation in the writer that is optional (and off by default). Careful memory management (and reporting / limiting) is probably critical
  2. Implement an API / callback for populating the statistics -- aka require the user code manage the gathering / fallback

I would suggest personally:

  1. Built in distinct statistics, enablable per column (as distinct counts are much more important for some columns)
  2. Add a memory limit for computing the distinct count, and if that is exceeded stop capturing statistics and write the data without the stats

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions