SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files #922

kartikgupta2607 · 2024-09-30T08:12:39Z

Please answer these questions before submitting your issue.
In order to accurately debug the issue this information is required. Thanks!

What version of NodeJS driver are you using?
1.9.3
What operating system and processor architecture are you using?
MacOS arm64
What version of NodeJS are you using?
(node --version and npm --version)
node : 18.12.1 , npm: 8.19.2
What are the component versions in the environment (npm list)?
NA
Server version:
8.9.1
What did you do?

Issue Summary

While executing a PUT query to stage a large, compressed CSV file from the local file system to a Snowflake stage (S3), the memory usage of the snowflake-sdk grows significantly, especially with large files. During the execution, the Snowflake SDK performs several operations:

Compression (if the file is not already compressed),
SHA-256 Digest Calculation,
AES Encryption,
Upload to S3 (or other remote storage).

While these steps are necessary, the SDK's memory footprint grows significantly based on the file size, which appears to be due to the following reasons:

Digest Calculation:
- The SDK calculates the SHA-256 digest of the file by reading the entire file into memory (Ref code).
- For large files, this leads to high memory consumption, which can cause memory-related issues or crashes.
- Suggestion: Instead of loading the entire file into memory, the hash can be calculated incrementally for each chunk of data as it is read. This is possible by updating the hash during the streaming process, reducing the memory footprint. (Crypto module Ref - This can be called many times with new data as it is streamed.)
File Upload:
- When the SDK attempts to upload the encrypted file to the remote storage provider (S3, GCS, Azure), it reads the file synchronously into memory (using readFileSync), which again leads to excessive memory consumption for large files. [Ref Code - S3, GCS, Azure]
- Suggestion: The SDK should leverage streams (createReadStream) during the upload process instead of reading the entire file into memory. Streaming the file to the storage provider would significantly reduce the memory overhead, especially for large files.

Steps to Reproduce:

Prepare a large, compressed CSV file (e.g., several GB in size) [ Example script to generate the necessary data file - file-gzip.txt].
Use the following script to execute a PUT query to upload the file to a Snowflake stage (S3 in my case) [Script - Execute_PUT.txt]

While executing the query, monitor memory usage using tools like: Node.js process memory logging, clinic doctor or any external memory profiling tool.

What did you expect to see?

Ideally, the SDK should minimise memory consumption by using a streaming approach for both the digest calculation and file upload steps. This would help in handling large files more efficiently.

Can you set logging to DEBUG and collect the logs?
No
What is your Snowflake account identifier, if any? (Optional)

The text was updated successfully, but these errors were encountered:

sfc-gh-dszmolka · 2024-09-30T08:23:00Z

thank you for raising this enhancement request with us, we'll consider it for the future roadmap (with no timeline commitment)
really appreciate the details you provided and the suggestions!

kartikgupta2607 added the bug Something isn't working label Sep 30, 2024

github-actions bot changed the title ~~High Memory Usage during PUT Query execution for Large GZIP compressed CSV files~~ SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files Sep 30, 2024

sfc-gh-dszmolka self-assigned this Sep 30, 2024

sfc-gh-dszmolka added status-triage Issue is under initial triage and removed bug Something isn't working labels Sep 30, 2024

sfc-gh-dszmolka mentioned this issue Sep 30, 2024

SNOW-1679281: High CPU Utilization and Event Loop Blockage Due to Synchronous Code Leading to CPU Blockage For large Files in PUT Command #917

Closed

sfc-gh-dszmolka added enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Sep 30, 2024

sfc-gh-dszmolka assigned sfc-gh-snow-drivers-warsaw-dl and unassigned sfc-gh-dszmolka Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files #922

SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files #922

kartikgupta2607 commented Sep 30, 2024

sfc-gh-dszmolka commented Sep 30, 2024

SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files #922

SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files #922

Comments

kartikgupta2607 commented Sep 30, 2024

Issue Summary

sfc-gh-dszmolka commented Sep 30, 2024