Skip to content

[Parquet][C++] Logical types with sort order UNKNOWN are missing null_count statistics #46205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
paleolimbot opened this issue Apr 22, 2025 · 3 comments

Comments

@paleolimbot
Copy link
Member

Describe the enhancement requested

The C++ Parquet implementation after adding variant and geometry will have several logical types with a sort order of UNKNOWN. The current implementation of statistics will not calculate a null count and add that statistic to the column metadata if the sort order is unknown, so this particular piece of information will be missing for geometry, geography, and variant. For geometry in particular, it will be needed to effectively push down a query rectangle (or else there is no mechanism to detect completely null row groups).

I'm not sure what the best way is to implement this...for geometry specifically we could keep track of the null count in the GeoStatistics but this wouldn't help with variant. I'm also not sure if the null count + statistics should be written at the page level for these types or not.

Noted by @wgtmac in #45459

Component(s)

Parquet, C++

@wgtmac
Copy link
Member

wgtmac commented Apr 23, 2025

The current parquet::Statistics implementation is tied to TypeDefinedOrder. We will also add a IEEE754TotalOrder as proposed by apache/parquet-format#221. Perhaps we can refactor the parquet::Statistics to be aware of column order? For TypeDefinedOrder and IEEE754TotalOrder, parquet::Statistics collects all fields. For UndefinedOrder, the stats writes empty min and max but keeps other fields including null_count.

cc @mapleFU

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2025

Perhaps we can refactor the parquet::Statistics to be aware of column order

Statisitcs has different part, null_count can not aware the order, but min, max and related should...

@paleolimbot
Copy link
Member Author

No rush on my end (and no offense taken if either of you would rather take this on!), but I started #46275 to wrap my head around the issue. Happy to take pretty much any angle and run with it (or review if somebody else would like to take it on!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants