Skip to content

Conversation

@nicornk
Copy link

@nicornk nicornk commented Oct 20, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-2409156 SNOW-2409156: Add write_parquet function equivalent to write_arrow #3888

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

The ability to create tables from parquet files is already included in the write_arrow function. I have refactored write_arrow into write_arrow and write_parquet where write_parquet receives a generator that yields parquet file path's.

Before I continue I would like to get clarification from the maintainers on the following things that I noticed while touching and modifying the code:

1. Current behavior of Delete file after chunk parquet is uploaded

The code currently does the following (pseudocode)
init stage, file format etc.
for each arrow chunk
write parquet file using pyarrow
PUT parquet file
delete parquet file
infer schema from staged parquet files
COPY INTO from stage to table
...

To continue to support the delete parquet after single upload I had to add the generator approach which complicates the implementation.

Question: Is it okay to drop that behavior and assume the host has enough disk space to store all parquet files at once? My answer would be: Yes, it's sensible that the host as larger disk space than the memory required to save the in-memory arrow chunks (which are decompressed).

2. _is_internal=True setting while passing Queries to the SnowflakeCursor

As a user of the library I found this extremely inconventint as it means the queries are hidden from the Query History UI and I needed to send manual SQL Queries against the QUERY_HISTORY table. Can we set _is_internal=False in the queries we send in the write_parquet function?

3. write_files_in_parallel argument
I added this argument because in real world workloads using PUT with a glob string (e.g. PUT folder/*.parquet) was magnitudes faster than:
for file in files:
PUT file

To keep the implementation simple and avoid too many edge cases I have limited write_files_in_parallel to folders which do not have any nested folders. I error early and tell users to use write_files_in_parallel=False to fallback to the for loop approach.

Is my understanding of PUT correct? Or is there a simple way to call PUT with behavior "all parquet files in the top-level and all nested subdirectories"?

4. compression argument

I am confused about this argument in the write_arrow call. This is the docstring

compression: The compression used on the Parquet files, can only be gzip, or snappy. Gzip gives a
                better compression, while snappy is faster. Use whichever is more appropriate (Default value = 'gzip').

However, compression is not passed to pyarrow's write_table function (which defaults to snappy).

It is passed into the following calls though, after a transformation through a map (compression_map = {"gzip": "auto", "snappy": "snappy", "none": "none"})

Usually, I would expect that we shouldn't set any compression on the stage, file format and the COPY INTO - as the parquet files are already snappy compressed.

My question: I have changed the default in write_parquet to auto. Should I change the default in write_arrow to auto as well? Currently, it seems to be gzip, which is mapped to auto through the compression_map.

Thank you!

@github-actions
Copy link

github-actions bot commented Oct 20, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@nicornk
Copy link
Author

nicornk commented Oct 20, 2025

I have read the CLA Document and I hereby sign the CLA

@nicornk
Copy link
Author

nicornk commented Oct 20, 2025

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant