You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The C++ Parquet implementation after adding variant and geometry will have several logical types with a sort order of UNKNOWN. The current implementation of statistics will not calculate a null count and add that statistic to the column metadata if the sort order is unknown, so this particular piece of information will be missing for geometry, geography, and variant. For geometry in particular, it will be needed to effectively push down a query rectangle (or else there is no mechanism to detect completely null row groups).
I'm not sure what the best way is to implement this...for geometry specifically we could keep track of the null count in the GeoStatistics but this wouldn't help with variant. I'm also not sure if the null count + statistics should be written at the page level for these types or not.
The current parquet::Statistics implementation is tied to TypeDefinedOrder. We will also add a IEEE754TotalOrder as proposed by apache/parquet-format#221. Perhaps we can refactor the parquet::Statistics to be aware of column order? For TypeDefinedOrder and IEEE754TotalOrder, parquet::Statistics collects all fields. For UndefinedOrder, the stats writes empty min and max but keeps other fields including null_count.
No rush on my end (and no offense taken if either of you would rather take this on!), but I started #46275 to wrap my head around the issue. Happy to take pretty much any angle and run with it (or review if somebody else would like to take it on!)
Describe the enhancement requested
The C++ Parquet implementation after adding variant and geometry will have several logical types with a sort order of UNKNOWN. The current implementation of statistics will not calculate a null count and add that statistic to the column metadata if the sort order is unknown, so this particular piece of information will be missing for geometry, geography, and variant. For geometry in particular, it will be needed to effectively push down a query rectangle (or else there is no mechanism to detect completely null row groups).
I'm not sure what the best way is to implement this...for geometry specifically we could keep track of the null count in the
GeoStatistics
but this wouldn't help with variant. I'm also not sure if the null count + statistics should be written at the page level for these types or not.Noted by @wgtmac in #45459
Component(s)
Parquet, C++
The text was updated successfully, but these errors were encountered: