SNOW-2409156: Add write_parquet function equivalent to write_arrow #3916
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2409156 SNOW-2409156: Add write_parquet function equivalent to write_arrow #3888
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
The ability to create tables from parquet files is already included in the
write_arrowfunction. I have refactoredwrite_arrowintowrite_arrowandwrite_parquetwherewrite_parquetreceives a generator that yields parquet file path's.Before I continue I would like to get clarification from the maintainers on the following things that I noticed while touching and modifying the code:
1. Current behavior of Delete file after chunk parquet is uploaded
The code currently does the following (pseudocode)
init stage, file format etc.
for each arrow chunk
write parquet file using pyarrow
PUT parquet file
delete parquet file
infer schema from staged parquet files
COPY INTO from stage to table
...
To continue to support the
delete parquetafter single upload I had to add the generator approach which complicates the implementation.Question: Is it okay to drop that behavior and assume the host has enough disk space to store all parquet files at once? My answer would be: Yes, it's sensible that the host as larger disk space than the memory required to save the in-memory arrow chunks (which are decompressed).
2. _is_internal=True setting while passing Queries to the SnowflakeCursor
As a user of the library I found this extremely inconventint as it means the queries are hidden from the Query History UI and I needed to send manual SQL Queries against the QUERY_HISTORY table. Can we set _is_internal=False in the queries we send in the
write_parquetfunction?3. write_files_in_parallel argument
I added this argument because in real world workloads using PUT with a glob string (e.g. PUT folder/*.parquet) was magnitudes faster than:
for file in files:
PUT file
To keep the implementation simple and avoid too many edge cases I have limited
write_files_in_parallelto folders which do not have any nested folders. I error early and tell users to usewrite_files_in_parallel=Falseto fallback to the for loop approach.Is my understanding of PUT correct? Or is there a simple way to call PUT with behavior "all parquet files in the top-level and all nested subdirectories"?
4. compression argument
I am confused about this argument in the
write_arrowcall. This is the docstringHowever, compression is not passed to pyarrow's
write_tablefunction (which defaults tosnappy).It is passed into the following calls though, after a transformation through a map (
compression_map = {"gzip": "auto", "snappy": "snappy", "none": "none"})Usually, I would expect that we shouldn't set any compression on the stage, file format and the COPY INTO - as the parquet files are already
snappycompressed.My question: I have changed the default in write_parquet to
auto. Should I change the default in write_arrow toautoas well? Currently, it seems to be gzip, which is mapped toautothrough thecompression_map.Thank you!