-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46205: [C++][Parquet][WIP] Read/Write null count statistics for UNKNOWN sort order #46275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-46205: [C++][Parquet][WIP] Read/Write null count statistics for UNKNOWN sort order #46275
Conversation
|
page_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_); | ||
chunk_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual version of this should maybe modify the if (SortOrder::UNKNOWN != descr_->sort_order()) {
check just above. Perhaps there needs to be a descr_->can_write_statistics()
to separate the sortedness from whether or not we can write anything?
bool is_geometry = | ||
descr_->logical_type() != nullptr && descr_->logical_type()->is_geometry(); | ||
if (!column_metadata_->__isset.statistics || | ||
descr_->sort_order() == SortOrder::UNKNOWN) { | ||
(descr_->sort_order() == SortOrder::UNKNOWN && !is_geometry)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, maybe we need a descr_->can_read_statistics
? There's a HasCorrectStatistics()
, too, and maybe the check needs to be there. I wonder whether the types currently marked as unsorted had null counts written reliably by other implementations or whether we have to ignore those?
template <typename DType> | ||
class UnsortedTypedStatisticsImpl : public TypedStatistics<DType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the answer here...I just needed a TypedStatistics<>
to make this work in the ColumnWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can reuse the existing typed one and just ignore the stats on write? (Seems inefficient but may be more compact?)
Rationale for this change
Minimum and maximum values are not useful in the context of an unsorted converted or loigcal type; however, null counts are! Geometry is one example of an unsorted type where this type of information might be useful; however, there are others as well!
What changes are included in this PR?
Early work-in-progress explorations to find the relevant pieces of code.
Are these changes tested?
They will be once a general direction is decided on!
Are there any user-facing changes?
Possibly!