Skip to content

Support file_size_bytes option #100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 7, 2025
Merged

Support file_size_bytes option #100

merged 2 commits into from
Apr 7, 2025

Conversation

aykut-bozkurt
Copy link
Member

@aykut-bozkurt aykut-bozkurt commented Jan 22, 2025

COPY TO parquet now supports a new option, called file_size_bytes, which lets you generate parquet files with target size = file_size_bytes.

When a parquet file exceeds the target size, it will be flushed and a new parquet file will be generated under a parent directory. (parent directory will be the path without the parquet extension)

e.g.

COPY (select 'hellooooo' || i from generate_series(1, 1000000) i) to '/tmp/test.parquet' with (file_size_bytes '1MB');
> ls -alh /tmp/test.parquet/
1.4M data_0.parquet
1.4M data_1.parquet
1.4M data_2.parquet
1.4M data_3.parquet
114K data_4.parquet

Closes #107.

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/cache-object-stores branch from 9d90243 to a493016 Compare January 24, 2025 14:46
Base automatically changed from aykut/cache-object-stores to main January 30, 2025 07:22
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/file-size-bytes branch 2 times, most recently from 6428599 to a857cb6 Compare January 30, 2025 08:22
Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 95.71734% with 20 lines in your changes missing coverage. Please review.

Project coverage is 92.90%. Comparing base (b626eb4) to head (59da25c).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...c/parquet_copy_hook/copy_to_split_dest_receiver.rs 92.81% 12 Missing ⚠️
src/parquet_copy_hook/copy_utils.rs 89.55% 7 Missing ⚠️
src/parquet_copy_hook/copy_to_dest_receiver.rs 98.41% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   92.43%   92.90%   +0.46%     
==========================================
  Files          85       86       +1     
  Lines       11288    11650     +362     
==========================================
+ Hits        10434    10823     +389     
+ Misses        854      827      -27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -1194,8 +1194,6 @@ mod tests {
results
});

Spi::run("TRUNCATE dog_owners;").unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix a wrong flow that revealed by the PR

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/file-size-bytes branch 5 times, most recently from a707629 to 8a4d5ec Compare March 11, 2025 09:18
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/file-size-bytes branch from 8a4d5ec to 59715d6 Compare March 14, 2025 12:03
{
parquet_dest.copy_options.row_group_size
} else {
RECORD_BATCH_SIZE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this end up being the row group size? 1024 seems low

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet_dest.copy_options.row_group_size will be 122880 by default. In case, it is explicitly specified by user as a lower size than RECORD_BATCH_SIZE, then we make sure process at least RECORD_BATCH_SIZE for performance reasons.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, could use a comment

README.md Outdated
@@ -248,6 +248,7 @@ Supported authorization methods' priority order is shown below:
## Copy Options
`pg_parquet` supports the following options in the `COPY TO` command:
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
- `file_size_bytes <int>`: the total byte size per Parquet file. When set, the parquet files, with target size, are created under parent directory (named the same as file name without file extension). By default, when not specified, a single file is generated without creating a parent folder.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super minor: this int looks a bit wrong, so maybe int64

// append child id to final part of uri
let file_id = self.current_child_id;

let child_uri = parent_folder.join(format!("data_{file_id}{file_extension}"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's common to use anything other than .parquet as the extension.

the way this works in DuckDB is that the filename becomes a directory name (even if it contains .parquet), and .parquet is always appended.

README.md Outdated
@@ -248,6 +248,7 @@ Supported authorization methods' priority order is shown below:
## Copy Options
`pg_parquet` supports the following options in the `COPY TO` command:
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension,
- `file_size_bytes <int>`: the total byte size per Parquet file. When set, the parquet files, with target size, are created under parent directory (named the same as file name without file extension). By default, when not specified, a single file is generated without creating a parent folder.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a function in postgres we could use to parse something like '512MB' as well? Kind of hard to remember how many bytes that is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but we could parse units from string.

COPY TO parquet now supports a new option, called `file_size_bytes`, which lets you
generate parquet files with target size = `file_size_bytes`.

When a parquet file exceeds the target size, it will be flushed and a new parquet file
will be generated under a parent directory. (parent directory will be the path without
the parquet extension)

e.g.

```sql
COPY (select 'hellooooo' || i from generate_series(1, 1000000) i) to '/tmp/test.parquet' with (file_size_bytes 1048576);
```

```bash
> ls -alh /tmp/test/
1.4M data_0.parquet
1.4M data_1.parquet
1.4M data_2.parquet
1.4M data_3.parquet
114K data_4.parquet
```
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/file-size-bytes branch from 462c1a0 to 59da25c Compare April 7, 2025 10:04
@aykut-bozkurt aykut-bozkurt requested a review from marcoslot April 7, 2025 10:13
@aykut-bozkurt aykut-bozkurt merged commit f8c3d62 into main Apr 7, 2025
6 checks passed
@aykut-bozkurt aykut-bozkurt deleted the aykut/file-size-bytes branch April 7, 2025 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support file splitting
2 participants