-
Notifications
You must be signed in to change notification settings - Fork 26
Support file_size_bytes option #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
974c773
to
db443bb
Compare
9d90243
to
a493016
Compare
db443bb
to
3bd4f41
Compare
6428599
to
a857cb6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #100 +/- ##
==========================================
+ Coverage 92.43% 92.90% +0.46%
==========================================
Files 85 86 +1
Lines 11288 11650 +362
==========================================
+ Hits 10434 10823 +389
+ Misses 854 827 -27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -1194,8 +1194,6 @@ mod tests { | |||
results | |||
}); | |||
|
|||
Spi::run("TRUNCATE dog_owners;").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix a wrong flow that revealed by the PR
a857cb6
to
2c06724
Compare
a707629
to
8a4d5ec
Compare
8a4d5ec
to
59715d6
Compare
{ | ||
parquet_dest.copy_options.row_group_size | ||
} else { | ||
RECORD_BATCH_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this end up being the row group size? 1024 seems low
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet_dest.copy_options.row_group_size
will be 122880 by default. In case, it is explicitly specified by user as a lower size than RECORD_BATCH_SIZE
, then we make sure process at least RECORD_BATCH_SIZE
for performance reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, could use a comment
README.md
Outdated
@@ -248,6 +248,7 @@ Supported authorization methods' priority order is shown below: | |||
## Copy Options | |||
`pg_parquet` supports the following options in the `COPY TO` command: | |||
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension, | |||
- `file_size_bytes <int>`: the total byte size per Parquet file. When set, the parquet files, with target size, are created under parent directory (named the same as file name without file extension). By default, when not specified, a single file is generated without creating a parent folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super minor: this int looks a bit wrong, so maybe int64
// append child id to final part of uri | ||
let file_id = self.current_child_id; | ||
|
||
let child_uri = parent_folder.join(format!("data_{file_id}{file_extension}")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's common to use anything other than .parquet as the extension.
the way this works in DuckDB is that the filename becomes a directory name (even if it contains .parquet), and .parquet is always appended.
README.md
Outdated
@@ -248,6 +248,7 @@ Supported authorization methods' priority order is shown below: | |||
## Copy Options | |||
`pg_parquet` supports the following options in the `COPY TO` command: | |||
- `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.<compression>]` extension, | |||
- `file_size_bytes <int>`: the total byte size per Parquet file. When set, the parquet files, with target size, are created under parent directory (named the same as file name without file extension). By default, when not specified, a single file is generated without creating a parent folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a function in postgres we could use to parse something like '512MB' as well? Kind of hard to remember how many bytes that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but we could parse units from string.
COPY TO parquet now supports a new option, called `file_size_bytes`, which lets you generate parquet files with target size = `file_size_bytes`. When a parquet file exceeds the target size, it will be flushed and a new parquet file will be generated under a parent directory. (parent directory will be the path without the parquet extension) e.g. ```sql COPY (select 'hellooooo' || i from generate_series(1, 1000000) i) to '/tmp/test.parquet' with (file_size_bytes 1048576); ``` ```bash > ls -alh /tmp/test/ 1.4M data_0.parquet 1.4M data_1.parquet 1.4M data_2.parquet 1.4M data_3.parquet 114K data_4.parquet ```
462c1a0
to
59da25c
Compare
COPY TO parquet now supports a new option, called
file_size_bytes
, which lets you generate parquet files with target size =file_size_bytes
.When a parquet file exceeds the target size, it will be flushed and a new parquet file will be generated under a parent directory. (parent directory will be the path without the parquet extension)
e.g.
> ls -alh /tmp/test.parquet/ 1.4M data_0.parquet 1.4M data_1.parquet 1.4M data_2.parquet 1.4M data_3.parquet 114K data_4.parquet
Closes #107.