Support file_size_bytes option #100

aykut-bozkurt · 2025-01-22T14:07:17Z

COPY TO parquet now supports a new option, called file_size_bytes, which lets you generate parquet files with target size = file_size_bytes.

When a parquet file exceeds the target size, it will be flushed and a new parquet file will be generated under a parent directory. (parent directory will be the path without the parquet extension)

e.g.

COPY (select 'hellooooo' || i from generate_series(1, 1000000) i) to '/tmp/test.parquet' with (file_size_bytes '1MB');

> ls -alh /tmp/test.parquet/
1.4M data_0.parquet
1.4M data_1.parquet
1.4M data_2.parquet
1.4M data_3.parquet
114K data_4.parquet

Closes #107.

codecov · 2025-01-30T08:29:34Z

Codecov Report

Attention: Patch coverage is 95.71734% with 20 lines in your changes missing coverage. Please review.

Project coverage is 92.90%. Comparing base (b626eb4) to head (59da25c).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...c/parquet_copy_hook/copy_to_split_dest_receiver.rs	92.81%	12 Missing ⚠️
src/parquet_copy_hook/copy_utils.rs	89.55%	7 Missing ⚠️
src/parquet_copy_hook/copy_to_dest_receiver.rs	98.41%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   92.43%   92.90%   +0.46%     
==========================================
  Files          85       86       +1     
  Lines       11288    11650     +362     
==========================================
+ Hits        10434    10823     +389     
+ Misses        854      827      -27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

aykut-bozkurt · 2025-01-30T09:46:08Z

src/pgrx_tests/copy_type_roundtrip.rs

@@ -1194,8 +1194,6 @@ mod tests {
            results
        });

-        Spi::run("TRUNCATE dog_owners;").unwrap();


fix a wrong flow that revealed by the PR

marcoslot · 2025-04-02T11:48:26Z

src/parquet_copy_hook/copy_to_dest_receiver.rs

+    {
+        parquet_dest.copy_options.row_group_size
+    } else {
+        RECORD_BATCH_SIZE


will this end up being the row group size? 1024 seems low

parquet_dest.copy_options.row_group_size will be 122880 by default. In case, it is explicitly specified by user as a lower size than RECORD_BATCH_SIZE, then we make sure process at least RECORD_BATCH_SIZE for performance reasons.

got it, could use a comment

marcoslot · 2025-04-07T08:06:06Z

README.md

@@ -248,6 +248,7 @@ Supported authorization methods' priority order is shown below:
 ## Copy Options
 `pg_parquet` supports the following options in the `COPY TO` command:
 - `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
+- `file_size_bytes <int>`: the total byte size per Parquet file. When set, the parquet files, with target size, are created under parent directory (named the same as file name without file extension). By default, when not specified, a single file is generated without creating a parent folder.


super minor: this int looks a bit wrong, so maybe int64

marcoslot · 2025-04-07T08:09:17Z

src/parquet_copy_hook/copy_to_split_dest_receiver.rs

+        // append child id to final part of uri
+        let file_id = self.current_child_id;
+
+        let child_uri = parent_folder.join(format!("data_{file_id}{file_extension}"));


I don't think it's common to use anything other than .parquet as the extension.

the way this works in DuckDB is that the filename becomes a directory name (even if it contains .parquet), and .parquet is always appended.

marcoslot · 2025-04-07T08:09:48Z

README.md

@@ -248,6 +248,7 @@ Supported authorization methods' priority order is shown below:
 ## Copy Options
 `pg_parquet` supports the following options in the `COPY TO` command:
 - `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
+- `file_size_bytes <int>`: the total byte size per Parquet file. When set, the parquet files, with target size, are created under parent directory (named the same as file name without file extension). By default, when not specified, a single file is generated without creating a parent folder.


is there a function in postgres we could use to parse something like '512MB' as well? Kind of hard to remember how many bytes that is.

No, but we could parse units from string.

COPY TO parquet now supports a new option, called `file_size_bytes`, which lets you generate parquet files with target size = `file_size_bytes`. When a parquet file exceeds the target size, it will be flushed and a new parquet file will be generated under a parent directory. (parent directory will be the path without the parquet extension) e.g. ```sql COPY (select 'hellooooo' || i from generate_series(1, 1000000) i) to '/tmp/test.parquet' with (file_size_bytes 1048576); ``` ```bash > ls -alh /tmp/test/ 1.4M data_0.parquet 1.4M data_1.parquet 1.4M data_2.parquet 1.4M data_3.parquet 114K data_4.parquet ```

aykut-bozkurt force-pushed the aykut/file-size-bytes branch from 974c773 to db443bb Compare January 22, 2025 14:13

aykut-bozkurt force-pushed the aykut/cache-object-stores branch from 9d90243 to a493016 Compare January 24, 2025 14:46

aykut-bozkurt force-pushed the aykut/file-size-bytes branch from db443bb to 3bd4f41 Compare January 24, 2025 14:48

Base automatically changed from aykut/cache-object-stores to main January 30, 2025 07:22

aykut-bozkurt force-pushed the aykut/file-size-bytes branch 2 times, most recently from 6428599 to a857cb6 Compare January 30, 2025 08:22

aykut-bozkurt commented Jan 30, 2025

View reviewed changes

aykut-bozkurt force-pushed the aykut/file-size-bytes branch from a857cb6 to 2c06724 Compare January 30, 2025 22:09

aykut-bozkurt requested a review from marcoslot January 31, 2025 19:33

aykut-bozkurt force-pushed the aykut/file-size-bytes branch 5 times, most recently from a707629 to 8a4d5ec Compare March 11, 2025 09:18

aykut-bozkurt force-pushed the aykut/file-size-bytes branch from 8a4d5ec to 59715d6 Compare March 14, 2025 12:03

marcoslot reviewed Apr 2, 2025

View reviewed changes

marcoslot reviewed Apr 7, 2025

View reviewed changes

aykut-bozkurt added 2 commits April 7, 2025 13:04

address

59da25c

aykut-bozkurt force-pushed the aykut/file-size-bytes branch from 462c1a0 to 59da25c Compare April 7, 2025 10:04

aykut-bozkurt requested a review from marcoslot April 7, 2025 10:13

marcoslot approved these changes Apr 7, 2025

View reviewed changes

aykut-bozkurt merged commit f8c3d62 into main Apr 7, 2025
6 checks passed

aykut-bozkurt deleted the aykut/file-size-bytes branch April 7, 2025 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support file_size_bytes option #100

Support file_size_bytes option #100

Uh oh!

aykut-bozkurt commented Jan 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jan 30, 2025 •

edited

Loading

Uh oh!

aykut-bozkurt Jan 30, 2025

Uh oh!

marcoslot Apr 2, 2025

Uh oh!

aykut-bozkurt Apr 2, 2025

Uh oh!

marcoslot Apr 7, 2025

Uh oh!

marcoslot Apr 7, 2025

Uh oh!

marcoslot Apr 7, 2025

Uh oh!

marcoslot Apr 7, 2025

Uh oh!

aykut-bozkurt Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

Support file_size_bytes option #100

Support file_size_bytes option #100

Uh oh!

Conversation

aykut-bozkurt commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aykut-bozkurt commented Jan 22, 2025 •

edited

Loading

codecov bot commented Jan 30, 2025 •

edited

Loading