fix(breadbox): workaround for memory problem reading parquet files. #122

pgm · 2024-11-05T20:25:03Z

Very small change to work around issue with pd.read_parquet(). (memory utilization exploding when trying to read a large-ish parquet file)

snwessel · 2024-11-05T21:13:53Z

breadbox/breadbox/io/data_validation.py

+    # to > 30GB and would take down breadbox. Reading it in as a table, and then
+    # processing it column by column seems to avoid this problem. Not 100% sure that this
+    # results in identical behavior, but it seems valid for numerical matrices
+    table = pq.read_table(filename)


If this fixes the issue, I think it looks good and I feel good about you merging it (especially since it's needed for the release).

But I am curious... I found this github issue, suggesting the problem is the pyarrow engine that pandas uses. I'd be curious if you could use a simpler solution of just doing:

pandas.read_parquet(engine='fastparquet')

The memory leak described there looks fairly different then how I'd characterize what I've observed. However, that's a good point about maybe the issue stems from pyarrow and not pandas.

I tried

pandas.read_parquet(engine='fastparquet')

and yes, that works fine. We'll have to add fastparquet as a dependency, but that sounds fine.

I'll switch the code to use that instead.

snwessel

Looks good to me! I had one question/comment, but we can revisit it later if it's helpful to just merge this in now

…122) * workaround for memory problem reading parquet files. * Switched to using fastparquet * fixed type check error

workaround for memory problem reading parquet files.

c20e4cb

pgm changed the title ~~workaround for memory problem reading parquet files.~~ fix(breadbox): workaround for memory problem reading parquet files. Nov 5, 2024

snwessel reviewed Nov 5, 2024

View reviewed changes

snwessel approved these changes Nov 5, 2024

View reviewed changes

pgm added 2 commits November 5, 2024 17:05

Switched to using fastparquet

c8b4f5c

fixed type check error

fa48279

pgm merged commit 066a806 into master Nov 6, 2024
15 checks passed

pgm added a commit that referenced this pull request Nov 7, 2024

fix(breadbox): workaround for memory problem reading parquet files. (#…

9575926

…122) * workaround for memory problem reading parquet files. * Switched to using fastparquet * fixed type check error

pgm added a commit that referenced this pull request Nov 7, 2024

fix(breadbox): workaround for memory problem reading parquet files. (#…

4d19774

…122) * workaround for memory problem reading parquet files. * Switched to using fastparquet * fixed type check error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(breadbox): workaround for memory problem reading parquet files. #122

fix(breadbox): workaround for memory problem reading parquet files. #122

Uh oh!

pgm commented Nov 5, 2024 •

edited

Loading

Uh oh!

snwessel Nov 5, 2024

Uh oh!

pgm Nov 5, 2024

Uh oh!

snwessel left a comment

Uh oh!

Uh oh!

Uh oh!

fix(breadbox): workaround for memory problem reading parquet files. #122

fix(breadbox): workaround for memory problem reading parquet files. #122

Uh oh!

Conversation

pgm commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snwessel Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

pgm Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

snwessel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pgm commented Nov 5, 2024 •

edited

Loading