-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-46205: [C++][Parquet][WIP] Read/Write null count statistics for UNKNOWN sort order #46275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-46205: [C++][Parquet][WIP] Read/Write null count statistics for UNKNOWN sort order #46275
Conversation
|
page_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_); | ||
chunk_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual version of this should maybe modify the if (SortOrder::UNKNOWN != descr_->sort_order()) {
check just above. Perhaps there needs to be a descr_->can_write_statistics()
to separate the sortedness from whether or not we can write anything?
bool is_geometry = | ||
descr_->logical_type() != nullptr && descr_->logical_type()->is_geometry(); | ||
if (!column_metadata_->__isset.statistics || | ||
descr_->sort_order() == SortOrder::UNKNOWN) { | ||
(descr_->sort_order() == SortOrder::UNKNOWN && !is_geometry)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, maybe we need a descr_->can_read_statistics
? There's a HasCorrectStatistics()
, too, and maybe the check needs to be there. I wonder whether the types currently marked as unsorted had null counts written reliably by other implementations or whether we have to ignore those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the spec, only INT96
primitive type and INTERVAL
logical type have undefined
column order (except complex types like map
, list
, variant
, and geo types).
- INT96: use
BINARY_AS_SIGNED_INTEGER_COMPARATOR
: https://github.com/apache/parquet-java/blob/5f079b98e63c814535e8709ab5c6fb672c2aedc5/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java#L366 - INTERVAL: use
UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
:https://github.com/apache/parquet-java/blob/5f079b98e63c814535e8709ab5c6fb672c2aedc5/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java#L406
However, I think parquet-java does not cleanly implement the column order semantics.
template <typename DType> | ||
class UnsortedTypedStatisticsImpl : public TypedStatistics<DType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the answer here...I just needed a TypedStatistics<>
to make this work in the ColumnWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can reuse the existing typed one and just ignore the stats on write? (Seems inefficient but may be more compact?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Java impl did this because the spec advises that min/max values are undefined and should not be used in this case. If we go with this approach, perhaps we need to disable page index (at least the column index) to reduce file size.
|
||
int64_t null_count() const override { return null_count_; } | ||
|
||
int64_t distinct_count() const override { return num_values_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw or return -1?
template <typename DType> | ||
class UnsortedTypedStatisticsImpl : public TypedStatistics<DType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Java impl did this because the spec advises that min/max values are undefined and should not be used in this case. If we go with this approach, perhaps we need to disable page index (at least the column index) to reduce file size.
bool is_geometry = | ||
descr_->logical_type() != nullptr && descr_->logical_type()->is_geometry(); | ||
if (!column_metadata_->__isset.statistics || | ||
descr_->sort_order() == SortOrder::UNKNOWN) { | ||
(descr_->sort_order() == SortOrder::UNKNOWN && !is_geometry)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the spec, only INT96
primitive type and INTERVAL
logical type have undefined
column order (except complex types like map
, list
, variant
, and geo types).
- INT96: use
BINARY_AS_SIGNED_INTEGER_COMPARATOR
: https://github.com/apache/parquet-java/blob/5f079b98e63c814535e8709ab5c6fb672c2aedc5/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java#L366 - INTERVAL: use
UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
:https://github.com/apache/parquet-java/blob/5f079b98e63c814535e8709ab5c6fb672c2aedc5/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java#L406
However, I think parquet-java does not cleanly implement the column order semantics.
// Unknown sort order has incorrect stats | ||
if (SortOrder::UNKNOWN == sort_order) { | ||
// Unknown sort order has incorrect stats if the min or the max are specified | ||
if (SortOrder::UNKNOWN == sort_order && (statistics.has_min || statistics.has_max)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just check geometry & geography types here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnsortedTypedStatisticsImpl
general LGTM. But can we use same way to create UnsortedTypedStatisticsImpl
? I found some code checks logical_type, other case just ignore and checks min-max
Besides, I think for other types without min-max (maybe like json or other logical types?) This is also usable?
|
||
EncodedStatistics Encode() override { | ||
EncodedStatistics out; | ||
out.set_null_count(null_count_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all_null_value
here?
Thanks for these! I'm mostly away from keyboard this week but will circle back to clean this up on Monday. |
Rationale for this change
Minimum and maximum values are not useful in the context of an unsorted converted or loigcal type; however, null counts are! Geometry is one example of an unsorted type where this type of information might be useful; however, there are others as well!
What changes are included in this PR?
Early work-in-progress explorations to find the relevant pieces of code.
Are these changes tested?
They will be once a general direction is decided on!
Are there any user-facing changes?
Possibly!