SNOW-2409156: Add write_parquet function equivalent to write_arrow #3916

nicornk · 2025-10-20T07:12:50Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-2409156 SNOW-2409156: Add write_parquet function equivalent to write_arrow #3888
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
- If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
Please describe how your code solves the related issue.

The ability to create tables from parquet files is already included in the write_arrow function. I have refactored write_arrow into write_arrow and write_parquet where write_parquet receives a generator that yields parquet file path's.

Before I continue I would like to get clarification from the maintainers on the following things that I noticed while touching and modifying the code:

1. Current behavior of Delete file after chunk parquet is uploaded

The code currently does the following (pseudocode)
init stage, file format etc.
for each arrow chunk
write parquet file using pyarrow
PUT parquet file
delete parquet file
infer schema from staged parquet files
COPY INTO from stage to table
...

To continue to support the delete parquet after single upload I had to add the generator approach which complicates the implementation.

Question: Is it okay to drop that behavior and assume the host has enough disk space to store all parquet files at once? My answer would be: Yes, it's sensible that the host as larger disk space than the memory required to save the in-memory arrow chunks (which are decompressed).

2. _is_internal=True setting while passing Queries to the SnowflakeCursor

As a user of the library I found this extremely inconventint as it means the queries are hidden from the Query History UI and I needed to send manual SQL Queries against the QUERY_HISTORY table. Can we set _is_internal=False in the queries we send in the write_parquet function?

3. write_files_in_parallel argument
I added this argument because in real world workloads using PUT with a glob string (e.g. PUT folder/*.parquet) was magnitudes faster than:
for file in files:
PUT file

To keep the implementation simple and avoid too many edge cases I have limited write_files_in_parallel to folders which do not have any nested folders. I error early and tell users to use write_files_in_parallel=False to fallback to the for loop approach.

Is my understanding of PUT correct? Or is there a simple way to call PUT with behavior "all parquet files in the top-level and all nested subdirectories"?

4. compression argument

I am confused about this argument in the write_arrow call. This is the docstring

compression: The compression used on the Parquet files, can only be gzip, or snappy. Gzip gives a
                better compression, while snappy is faster. Use whichever is more appropriate (Default value = 'gzip').

However, compression is not passed to pyarrow's write_table function (which defaults to snappy).

It is passed into the following calls though, after a transformation through a map (compression_map = {"gzip": "auto", "snappy": "snappy", "none": "none"})

Usually, I would expect that we shouldn't set any compression on the stage, file format and the COPY INTO - as the parquet files are already snappy compressed.

My question: I have changed the default in write_parquet to auto. Should I change the default in write_arrow to auto as well? Currently, it seems to be gzip, which is mapped to auto through the compression_map.

Thank you!

…e globstring. Validate that if write_files_in_parallel is True only a single directory with parquet files is passed.

…directories.

github-actions · 2025-10-20T07:13:01Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

nicornk · 2025-10-20T07:14:11Z

I have read the CLA Document and I hereby sign the CLA

nicornk · 2025-10-20T07:14:19Z

recheck

nicornk added 4 commits October 20, 2025 08:48

Refactor write_arrow into write_arrow and write_parquet

a719e0c

add public method to Session class.

cd8cffc

intermediate state. TODO: verify the PUT command actually included th…

295d11c

…e globstring. Validate that if write_files_in_parallel is True only a single directory with parquet files is passed.

verify PUT commands, restrict write_file_in_parallel to not allow sub…

98f4ea1

…directories.

nicornk requested review from a team as code owners October 20, 2025 07:12

nicornk requested review from sfc-gh-jrose, sfc-gh-yixie and sfc-gh-yuwang October 20, 2025 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SNOW-2409156: Add write_parquet function equivalent to write_arrow #3916

SNOW-2409156: Add write_parquet function equivalent to write_arrow #3916

Uh oh!

nicornk commented Oct 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

nicornk commented Oct 20, 2025

Uh oh!

nicornk commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SNOW-2409156: Add write_parquet function equivalent to write_arrow #3916

Are you sure you want to change the base?

SNOW-2409156: Add write_parquet function equivalent to write_arrow #3916

Uh oh!

Conversation

nicornk commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicornk commented Oct 20, 2025

Uh oh!

nicornk commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nicornk commented Oct 20, 2025 •

edited

Loading

github-actions bot commented Oct 20, 2025 •

edited

Loading