Skip to content

GH-46205: [C++][Parquet][WIP] Read/Write null count statistics for UNKNOWN sort order #46275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Apr 30, 2025

Rationale for this change

Minimum and maximum values are not useful in the context of an unsorted converted or loigcal type; however, null counts are! Geometry is one example of an unsorted type where this type of information might be useful; however, there are others as well!

What changes are included in this PR?

Early work-in-progress explorations to find the relevant pieces of code.

Are these changes tested?

They will be once a general direction is decided on!

Are there any user-facing changes?

Possibly!

Copy link

⚠️ GitHub issue #46205 has been automatically assigned in GitHub to PR creator.

Comment on lines +1256 to +1257
page_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_);
chunk_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual version of this should maybe modify the if (SortOrder::UNKNOWN != descr_->sort_order()) { check just above. Perhaps there needs to be a descr_->can_write_statistics() to separate the sortedness from whether or not we can write anything?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 30, 2025
Comment on lines +310 to +313
bool is_geometry =
descr_->logical_type() != nullptr && descr_->logical_type()->is_geometry();
if (!column_metadata_->__isset.statistics ||
descr_->sort_order() == SortOrder::UNKNOWN) {
(descr_->sort_order() == SortOrder::UNKNOWN && !is_geometry)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe we need a descr_->can_read_statistics? There's a HasCorrectStatistics(), too, and maybe the check needs to be there. I wonder whether the types currently marked as unsorted had null counts written reliably by other implementations or whether we have to ignore those?

Comment on lines +966 to +967
template <typename DType>
class UnsortedTypedStatisticsImpl : public TypedStatistics<DType> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the answer here...I just needed a TypedStatistics<> to make this work in the ColumnWriter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can reuse the existing typed one and just ignore the stats on write? (Seems inefficient but may be more compact?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant