Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files #922

Open
kartikgupta2607 opened this issue Sep 30, 2024 · 1 comment
Assignees
Labels
enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team

Comments

@kartikgupta2607
Copy link

Please answer these questions before submitting your issue.
In order to accurately debug the issue this information is required. Thanks!

  1. What version of NodeJS driver are you using?
    1.9.3

  2. What operating system and processor architecture are you using?
    MacOS arm64

  3. What version of NodeJS are you using?
    (node --version and npm --version)
    node : 18.12.1 , npm: 8.19.2

  4. What are the component versions in the environment (npm list)?
    NA

  5. Server version:
    8.9.1

  6. What did you do?

Issue Summary

While executing a PUT query to stage a large, compressed CSV file from the local file system to a Snowflake stage (S3), the memory usage of the snowflake-sdk grows significantly, especially with large files. During the execution, the Snowflake SDK performs several operations:

  1. Compression (if the file is not already compressed),
  2. SHA-256 Digest Calculation,
  3. AES Encryption,
  4. Upload to S3 (or other remote storage).

While these steps are necessary, the SDK's memory footprint grows significantly based on the file size, which appears to be due to the following reasons:

  • Digest Calculation:

    • The SDK calculates the SHA-256 digest of the file by reading the entire file into memory (Ref code).

    • For large files, this leads to high memory consumption, which can cause memory-related issues or crashes.

    • Suggestion: Instead of loading the entire file into memory, the hash can be calculated incrementally for each chunk of data as it is read. This is possible by updating the hash during the streaming process, reducing the memory footprint. (Crypto module Ref - This can be called many times with new data as it is streamed.)

  • File Upload:

    • When the SDK attempts to upload the encrypted file to the remote storage provider (S3, GCS, Azure), it reads the file synchronously into memory (using readFileSync), which again leads to excessive memory consumption for large files. [Ref Code - S3, GCS, Azure]
    • Suggestion: The SDK should leverage streams (createReadStream) during the upload process instead of reading the entire file into memory. Streaming the file to the storage provider would significantly reduce the memory overhead, especially for large files.

Steps to Reproduce:

  • Prepare a large, compressed CSV file (e.g., several GB in size) [ Example script to generate the necessary data file - file-gzip.txt].
  • Use the following script to execute a PUT query to upload the file to a Snowflake stage (S3 in my case) [Script - Execute_PUT.txt]

While executing the query, monitor memory usage using tools like: Node.js process memory logging, clinic doctor or any external memory profiling tool.

  1. What did you expect to see?
  • Ideally, the SDK should minimise memory consumption by using a streaming approach for both the digest calculation and file upload steps. This would help in handling large files more efficiently.
  1. Can you set logging to DEBUG and collect the logs?
    No

  2. What is your Snowflake account identifier, if any? (Optional)

@kartikgupta2607 kartikgupta2607 added the bug Something isn't working label Sep 30, 2024
@github-actions github-actions bot changed the title High Memory Usage during PUT Query execution for Large GZIP compressed CSV files SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files Sep 30, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Sep 30, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage Issue is under initial triage and removed bug Something isn't working labels Sep 30, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Sep 30, 2024
@sfc-gh-dszmolka
Copy link
Collaborator

thank you for raising this enhancement request with us, we'll consider it for the future roadmap (with no timeline commitment)
really appreciate the details you provided and the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

3 participants