Skip to content

GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages #46972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Jul 2, 2025

Rationale for this change

Reduces memory usage required when reading wide, encrypted Parquet files.

What changes are included in this PR?

Changes SerializedPageReader so that it doesn't hold a decryption buffer but only allocates one as needed, so it can be freed after pages are decompressed.

Are these changes tested?

This is only a performance improvement and doesn't change any behaviour so should be covered by existing tests.
The memory improvement has been verified manually (see #46971).

Are there any user-facing changes?

Are performance improvements considered user-facing?

Copy link

github-actions bot commented Jul 2, 2025

⚠️ GitHub issue #46971 has been automatically assigned in GitHub to PR creator.

@adamreeve adamreeve changed the title GH-46971: [C++][Parquet] Use temporary decryption buffers in Parquet SerialiedPageReader GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages Jul 2, 2025
@adamreeve adamreeve force-pushed the temp_decryption_buffers branch from a1a6dd7 to 2fb6895 Compare July 2, 2025 04:42
@pitrou
Copy link
Member

pitrou commented Jul 2, 2025

Do we want to do the same for the decompression buffer? It would also be easier to benchmark, as compressed Parquet files are much more common than encrypted Parquet files.

@adamreeve
Copy link
Contributor Author

I was assuming the decompression buffers remain referenced by the returned record batches so doing the same for those wouldn't help, but I haven't verified that that's true. I'll test that too.

@pitrou
Copy link
Member

pitrou commented Jul 2, 2025

I was assuming the decompression buffers remain referenced by the returned record batches so doing the same for those wouldn't help, but I haven't verified that that's true.

Only in the zero-copy cases, which are quite limited (fixed-width type, PLAIN encoding, no nulls, no encryption). And even then, we probably don't always do zero-copy.

@adamreeve adamreeve force-pushed the temp_decryption_buffers branch from 2fb6895 to 7d639fd Compare July 7, 2025 03:59
@adamreeve
Copy link
Contributor Author

Right OK that makes sense. I'm testing with plain encoded float columns so hadn't noticed that but yes it might be a good idea to also change the decompression buffers then.

I did some rough benchmarks with /usr/bin/time -v and get the following results for my test case (what's described in #46971 but reading all row groups), taking the best out of three runs:

System allocator mimalloc jemalloc
Baseline time (s) 6.92 6.89 6.68
Time with temp decrypt buffers (s) 8.23 6.00 6.42
Time with temp decrypt and decompress buffers (s) 6.50 6.08 6.59
System allocator mimalloc jemalloc
Baseline max RSS (MB) 1,556 1,550 1,128
Max RSS with temp decrypt buffers (MB) 894 891 627
Max RSS with temp decrypt and decompress buffers (MB) 1,823 890 629

The behaviour with mimalloc and jemalloc looks good, but the results with the system allocator are quite concerning. The max RSS decreases if using temporary decryption buffers, but actually increases quite significantly when also using temporary decompression buffers. I'm not sure why that would be, maybe this causes more memory fragmentation? (C++ heap memory management is not something I know a lot about...). There is also a noticeable slow-down in the temporary decryption buffer case with the system allocator.

Maybe this is acceptable, given most users will be using mimalloc?

I also tested with unencrypted data:

System allocator mimalloc jemalloc
Baseline time (s) 4.07 4.09 3.93
Time with temp decompress buffers (s) 4.08 4.25 3.98
System allocator mimalloc jemalloc
Baseline max RSS (MB) 884 895 627
Max RSS with temp decompress buffers (MB) 954 913 660

Based on these benchmarks alone, maybe only the decryption buffers should be temporary. But I've only tested with plain float data. I'll look into testing with more data types and encodings.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2025

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Jul 7, 2025

Benchmark runs are scheduled for commit 7d639fd. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2025

By the way, how did you get or generate your test files? A 1 MiB page size sounds rather large.

@adamreeve
Copy link
Contributor Author

I'm using ParquetSharp but this is a wrapper of the C++ Parquet library, and 1 MiB is the default there:

static constexpr int64_t kDefaultDataPageSize = 1024 * 1024;

@pitrou
Copy link
Member

pitrou commented Jul 7, 2025

I'm using ParquetSharp but this is a wrapper of the C++ Parquet library, and 1 MiB is the default there:

Hmm, I was under the impression that another factor limited the actual page size produced by Parquet C++, but I can't find again what it is. @wgtmac Could you enlighten me here?

@wgtmac
Copy link
Member

wgtmac commented Jul 7, 2025

@pitrou Did you mean CDC?

AddDataPage();

@pitrou
Copy link
Member

pitrou commented Jul 7, 2025

@wgtmac No, I mean other parameters.

@adamreeve
Copy link
Contributor Author

1024 * 1024 is also the default max row group size. Maybe for some integer or dictionary encoded data this limit can be hit before the page size?

@adamreeve
Copy link
Contributor Author

adamreeve commented Jul 7, 2025

Ah I think you're thinking of the write_batch_size parameter that's used by the Arrow API. This is a number of rows and defaults to 1024. I used the column writer based API rather than the Arrow API though.

@adamreeve
Copy link
Contributor Author

Hmm actually it looks like it shouldn't be specific to the Arrow API, I'll check what's happening there.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2025

I had already started a discussion on the Parquet dev ML about this: https://lists.apache.org/thread/vsxmbvnx9gy5414cfo25mnwcj17h1xyp

I do think we should revisit this default page size constant, even if in some cases other factors make it smaller.

Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 7d639fd.

There were 50 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2025

There doesn't seem to be any related regression on the benchmarks.
I've also run this PR locally on a couple Parquet files I have lying around, and could not see any concerning performance drop.

@adamreeve
Copy link
Contributor Author

I looked a bit closer at the write_batch_size parameter. This doesn't actually control how many values are written to a page, but just how many are written at once before checking whether the page size has reached the configured page size limit and writing out the page. From that mailing list thread, it sounds like other implementations have a byte based limit and a separate row count limit, but it doesn't look like there's a row limit in the C++ implementation.

@wgtmac
Copy link
Member

wgtmac commented Jul 8, 2025

FTR, we have a max_row_group_length to control number of rows in a single row group.

@pitrou
Copy link
Member

pitrou commented Jul 8, 2025

Ok, I've opened #47030

@adamreeve
Copy link
Contributor Author

I looked at benchmarking unencrypted int32 data with nulls and using a non-plain encoding (delta binary packed). Otherwise the data layout is the same as my previous tests. Making the decompression buffers temporary decreases the max RSS with the system allocator, which is a bit surprising to me. But there is a slight increase in RSS and time taken with mimalloc and jemalloc.

System allocator mimalloc jemalloc
Baseline time (s) 10.82 11.02 10.62
Time with temp decompress buffers (s) 10.93 11.53 10.96
System allocator mimalloc jemalloc
Baseline max RSS (MB) 1,235 1,047 891
Max RSS with temp decompress buffers (MB) 1,065 1,085 902

I also looked at the memory allocations with massif, and the peak heap size is exactly the same, and is still dominated by the decompression buffers. Although my comment about them being referenced by the record batches isn't correct, the page buffers are still held in memory by the column reader, and then batches of data are decoded incrementally to Arrow arrays.

So I don't think there's much reason to make the decompression buffers temporary and performance is generally a bit better if only the decryption buffers are temporary. I'm going to revert this PR back to only change the decryption buffers.

@adamreeve adamreeve force-pushed the temp_decryption_buffers branch from 7d639fd to 91477fb Compare July 9, 2025 04:57
@adamreeve adamreeve marked this pull request as ready for review July 9, 2025 05:33
@adamreeve adamreeve requested a review from wgtmac as a code owner July 9, 2025 05:33
@wgtmac
Copy link
Member

wgtmac commented Jul 10, 2025

Ah I think you're thinking of the write_batch_size parameter that's used by the Arrow API. This is a number of rows and defaults to 1024. I used the column writer based API rather than the Arrow API though.

I just realized that large properties_->write_batch_size() makes it difficult to precisely split data pages based on properties_->data_pagesize(). To implement #47030, we have to adjust batch size to satisfy the new properties_->max_rows_per_data_page(). Perhaps we need to slightly change the meaning of properties_->write_batch_size() to be the maximum number of values in a batch to write to a ColumnWriter. Does it make sense? @adamreeve @pitrou

@pitrou
Copy link
Member

pitrou commented Jul 10, 2025

Perhaps we need to slightly change the meaning of properties_->write_batch_size() to be the maximum number of values in a batch to write to a ColumnWriter. Does it make sense? @adamreeve @pitrou

It definitely makes sense to me. I think that's how the Rust and Java implementation use it.

(also, shouldn't it be discussed in #47030 ?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants