Skip to content

GH-46205: [C++][Parquet][WIP] Read/Write null count statistics for UNKNOWN sort order #46275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Apr 30, 2025

Rationale for this change

Minimum and maximum values are not useful in the context of an unsorted converted or loigcal type; however, null counts are! Geometry is one example of an unsorted type where this type of information might be useful; however, there are others as well!

What changes are included in this PR?

Early work-in-progress explorations to find the relevant pieces of code.

Are these changes tested?

They will be once a general direction is decided on!

Are there any user-facing changes?

Possibly!

Copy link

⚠️ GitHub issue #46205 has been automatically assigned in GitHub to PR creator.

Comment on lines +1256 to +1257
page_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_);
chunk_statistics_ = MakeStatistics<ParquetType>(descr_, allocator_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual version of this should maybe modify the if (SortOrder::UNKNOWN != descr_->sort_order()) { check just above. Perhaps there needs to be a descr_->can_write_statistics() to separate the sortedness from whether or not we can write anything?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 30, 2025
Comment on lines +310 to +313
bool is_geometry =
descr_->logical_type() != nullptr && descr_->logical_type()->is_geometry();
if (!column_metadata_->__isset.statistics ||
descr_->sort_order() == SortOrder::UNKNOWN) {
(descr_->sort_order() == SortOrder::UNKNOWN && !is_geometry)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe we need a descr_->can_read_statistics? There's a HasCorrectStatistics(), too, and maybe the check needs to be there. I wonder whether the types currently marked as unsorted had null counts written reliably by other implementations or whether we have to ignore those?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the spec, only INT96 primitive type and INTERVAL logical type have undefined column order (except complex types like map, list, variant, and geo types).

However, I think parquet-java does not cleanly implement the column order semantics.

Comment on lines +966 to +967
template <typename DType>
class UnsortedTypedStatisticsImpl : public TypedStatistics<DType> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the answer here...I just needed a TypedStatistics<> to make this work in the ColumnWriter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can reuse the existing typed one and just ignore the stats on write? (Seems inefficient but may be more compact?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Java impl did this because the spec advises that min/max values are undefined and should not be used in this case. If we go with this approach, perhaps we need to disable page index (at least the column index) to reduce file size.


int64_t null_count() const override { return null_count_; }

int64_t distinct_count() const override { return num_values_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw or return -1?

Comment on lines +966 to +967
template <typename DType>
class UnsortedTypedStatisticsImpl : public TypedStatistics<DType> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Java impl did this because the spec advises that min/max values are undefined and should not be used in this case. If we go with this approach, perhaps we need to disable page index (at least the column index) to reduce file size.

Comment on lines +310 to +313
bool is_geometry =
descr_->logical_type() != nullptr && descr_->logical_type()->is_geometry();
if (!column_metadata_->__isset.statistics ||
descr_->sort_order() == SortOrder::UNKNOWN) {
(descr_->sort_order() == SortOrder::UNKNOWN && !is_geometry)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the spec, only INT96 primitive type and INTERVAL logical type have undefined column order (except complex types like map, list, variant, and geo types).

However, I think parquet-java does not cleanly implement the column order semantics.

// Unknown sort order has incorrect stats
if (SortOrder::UNKNOWN == sort_order) {
// Unknown sort order has incorrect stats if the min or the max are specified
if (SortOrder::UNKNOWN == sort_order && (statistics.has_min || statistics.has_max)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just check geometry & geography types here?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnsortedTypedStatisticsImpl general LGTM. But can we use same way to create UnsortedTypedStatisticsImpl? I found some code checks logical_type, other case just ignore and checks min-max

Besides, I think for other types without min-max (maybe like json or other logical types?) This is also usable?


EncodedStatistics Encode() override {
EncodedStatistics out;
out.set_null_count(null_count_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all_null_value here?

@paleolimbot
Copy link
Member Author

Thanks for these! I'm mostly away from keyboard this week but will circle back to clean this up on Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants