Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support file_size_bytes option #100

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

aykut-bozkurt
Copy link
Collaborator

COPY TO parquet now supports a new option, called file_size_bytes, which lets you generate parquet files with target size = file_size_bytes.

When a parquet file exceeds the target size, it will be flushed and a new parquet file will be generated under a parent directory. (parent directory will be the path without the parquet extension)

e.g.

COPY (select 'hellooooo' || i from generate_series(1, 1000000) i) to '/tmp/test.parquet' with (file_size_bytes 1048576);
> ls -alh /tmp/test/
1.4M data_0.parquet
1.4M data_1.parquet
1.4M data_2.parquet
1.4M data_3.parquet
114K data_4.parquet

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/cache-object-stores branch from 9d90243 to a493016 Compare January 24, 2025 14:46
Base automatically changed from aykut/cache-object-stores to main January 30, 2025 07:22
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/file-size-bytes branch 2 times, most recently from 6428599 to a857cb6 Compare January 30, 2025 08:22
Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 96.62338% with 13 lines in your changes missing coverage. Please review.

Project coverage is 92.45%. Comparing base (3ff46d5) to head (2c06724).

Files with missing lines Patch % Lines
...c/parquet_copy_hook/copy_to_split_dest_receiver.rs 93.29% 12 Missing ⚠️
src/parquet_copy_hook/copy_to_dest_receiver.rs 98.07% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   91.88%   92.45%   +0.56%     
==========================================
  Files          77       78       +1     
  Lines       10320    10640     +320     
==========================================
+ Hits         9483     9837     +354     
+ Misses        837      803      -34     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -1194,8 +1194,6 @@ mod tests {
results
});

Spi::run("TRUNCATE dog_owners;").unwrap();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix a wrong flow that revealed by the PR

COPY TO parquet now supports a new option, called `file_size_bytes`, which lets you
generate parquet files with target size = `file_size_bytes`.

When a parquet file exceeds the target size, it will be flushed and a new parquet file
will be generated under a parent directory. (parent directory will be the path without
the parquet extension)

e.g.

```sql
COPY (select 'hellooooo' || i from generate_series(1, 1000000) i) to '/tmp/test.parquet' with (file_size_bytes 1048576);
```

```bash
> ls -alh /tmp/test/
1.4M data_0.parquet
1.4M data_1.parquet
1.4M data_2.parquet
1.4M data_3.parquet
114K data_4.parquet
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant