-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page #47090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Please check if this is the right direction. @pitrou @mapleFU @adamreeve BTW, some existing test cases will break if I switch the default value to limit 20,000 rows per page. I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks correct to me thanks @wgtmac.
I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.
A default of 100k would still change behaviour though, and most of the time result in smaller pages being written. I think it probably makes sense to use 20k to align with Java and Rust, but it could be done as a separate PR if there are a lot of test changes needed.
I don't imagine this should break any downstream code, but we'd definitely want to call it out in the release notes as something for users to be aware of.
I should also mention that #47032 touches the same part of the code. It looks like the fix from that PR can easily be ported to this new code though. |
…s written per page
@@ -293,6 +295,7 @@ class PARQUET_EXPORT WriterProperties { | |||
write_batch_size_(DEFAULT_WRITE_BATCH_SIZE), | |||
max_row_group_length_(DEFAULT_MAX_ROW_GROUP_LENGTH), | |||
pagesize_(kDefaultDataPageSize), | |||
max_rows_per_page_(kDefaultMaxRowsPerPage), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we don't want to limit this, setting this to int64_t max is ok?
@@ -155,6 +155,8 @@ class PARQUET_EXPORT ReaderProperties { | |||
ReaderProperties PARQUET_EXPORT default_reader_properties(); | |||
|
|||
static constexpr int64_t kDefaultDataPageSize = 1024 * 1024; | |||
/// FIXME: Switch the default value to 20000 will break UTs. | |||
static constexpr int64_t kDefaultMaxRowsPerPage = 1000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(So it don't follow the same style with other DEFAULT_...
?
} | ||
|
||
bool pages_change_on_record_boundaries, int64_t max_rows_per_page, | ||
const std::function<int64_t()>& curr_page_buffered_rows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like Action
, can we use template function rather than std::function<>
, which std::function
is hard to optimize by compiler, and might be nullptr
?
Should this PR be set to draft until it's ready? |
Rationale for this change
Currently only page size is limited. We need to limit number of rows per page too.
What changes are included in this PR?
Add
parquet::WriterProperties::max_rows_per_page(int64_t max_rows)
to limit number of rows per data page.Are these changes tested?
TODO
Are there any user-facing changes?
Yes, users are allowed to set this config value.