fix(breadbox): workaround for memory problem reading parquet files.#122
fix(breadbox): workaround for memory problem reading parquet files.#122
Conversation
| # to > 30GB and would take down breadbox. Reading it in as a table, and then | ||
| # processing it column by column seems to avoid this problem. Not 100% sure that this | ||
| # results in identical behavior, but it seems valid for numerical matrices | ||
| table = pq.read_table(filename) |
There was a problem hiding this comment.
If this fixes the issue, I think it looks good and I feel good about you merging it (especially since it's needed for the release).
But I am curious... I found this github issue, suggesting the problem is the pyarrow engine that pandas uses. I'd be curious if you could use a simpler solution of just doing:
pandas.read_parquet(engine='fastparquet')
There was a problem hiding this comment.
The memory leak described there looks fairly different then how I'd characterize what I've observed. However, that's a good point about maybe the issue stems from pyarrow and not pandas.
I tried
pandas.read_parquet(engine='fastparquet')
and yes, that works fine. We'll have to add fastparquet as a dependency, but that sounds fine.
I'll switch the code to use that instead.
snwessel
left a comment
There was a problem hiding this comment.
Looks good to me! I had one question/comment, but we can revisit it later if it's helpful to just merge this in now
…122) * workaround for memory problem reading parquet files. * Switched to using fastparquet * fixed type check error
…122) * workaround for memory problem reading parquet files. * Switched to using fastparquet * fixed type check error
Very small change to work around issue with pd.read_parquet(). (memory utilization exploding when trying to read a large-ish parquet file)