Skip to content

GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page #47090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Jul 12, 2025

Rationale for this change

Currently only page size is limited. We need to limit number of rows per page too.

What changes are included in this PR?

Add parquet::WriterProperties::max_rows_per_page(int64_t max_rows) to limit number of rows per data page.

Are these changes tested?

TODO

Are there any user-facing changes?

Yes, users are allowed to set this config value.

Copy link

⚠️ GitHub issue #40730 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented Jul 12, 2025

Please check if this is the right direction. @pitrou @mapleFU @adamreeve

BTW, some existing test cases will break if I switch the default value to limit 20,000 rows per page. I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.

@wgtmac wgtmac changed the title GH-40730: [C++][Parquet] Add setting to limit the number of rows written per page GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page Jul 12, 2025
Copy link

⚠️ GitHub issue #47030 has been automatically assigned in GitHub to PR creator.

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks correct to me thanks @wgtmac.

I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.

A default of 100k would still change behaviour though, and most of the time result in smaller pages being written. I think it probably makes sense to use 20k to align with Java and Rust, but it could be done as a separate PR if there are a lot of test changes needed.

I don't imagine this should break any downstream code, but we'd definitely want to call it out in the release notes as something for users to be aware of.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 14, 2025
@adamreeve
Copy link
Contributor

I should also mention that #47032 touches the same part of the code. It looks like the fix from that PR can easily be ported to this new code though.

@mapleFU mapleFU self-requested a review July 15, 2025 07:42
@@ -293,6 +295,7 @@ class PARQUET_EXPORT WriterProperties {
write_batch_size_(DEFAULT_WRITE_BATCH_SIZE),
max_row_group_length_(DEFAULT_MAX_ROW_GROUP_LENGTH),
pagesize_(kDefaultDataPageSize),
max_rows_per_page_(kDefaultMaxRowsPerPage),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we don't want to limit this, setting this to int64_t max is ok?

@@ -155,6 +155,8 @@ class PARQUET_EXPORT ReaderProperties {
ReaderProperties PARQUET_EXPORT default_reader_properties();

static constexpr int64_t kDefaultDataPageSize = 1024 * 1024;
/// FIXME: Switch the default value to 20000 will break UTs.
static constexpr int64_t kDefaultMaxRowsPerPage = 1000000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(So it don't follow the same style with other DEFAULT_...?

}

bool pages_change_on_record_boundaries, int64_t max_rows_per_page,
const std::function<int64_t()>& curr_page_buffered_rows) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like Action, can we use template function rather than std::function<>, which std::function is hard to optimize by compiler, and might be nullptr?

@pitrou
Copy link
Member

pitrou commented Jul 15, 2025

Should this PR be set to draft until it's ready?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants