Skip to content

Investigate and optimize memory handling during large file uploads to Redshift Spectrum #1210

@Rafalz13

Description

@Rafalz13

Description:

We previously attempted using del and gc.collect() to manage memory usage during large file uploads, but the issue persisted. This indicates that the problem may be related to lingering references to DataFrames, possibly within the df_to_redshift_spectrum function. The issue was primarily observed with very large files, while smaller file batches (a few GBs) loaded successfully without problems. There is also a risk that the same issue will reoccur in the future if we need to reload the large file. The current hypothesis is that repeated chunk rotations are filling up memory due to references not being properly released - either within Pandas itself or in our own implementation.

Goal

Identify the root cause of memory leaks during chunked DataFrame processing and implement fixes to ensure stable handling of large file uploads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpythonPull requests that update Python code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions