-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(breadbox): workaround for memory problem reading parquet files. #122
Conversation
# to > 30GB and would take down breadbox. Reading it in as a table, and then | ||
# processing it column by column seems to avoid this problem. Not 100% sure that this | ||
# results in identical behavior, but it seems valid for numerical matrices | ||
table = pq.read_table(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fixes the issue, I think it looks good and I feel good about you merging it (especially since it's needed for the release).
But I am curious... I found this github issue, suggesting the problem is the pyarrow engine that pandas uses. I'd be curious if you could use a simpler solution of just doing:
pandas.read_parquet(engine='fastparquet')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory leak described there looks fairly different then how I'd characterize what I've observed. However, that's a good point about maybe the issue stems from pyarrow and not pandas.
I tried
pandas.read_parquet(engine='fastparquet')
and yes, that works fine. We'll have to add fastparquet as a dependency, but that sounds fine.
I'll switch the code to use that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I had one question/comment, but we can revisit it later if it's helpful to just merge this in now
…122) * workaround for memory problem reading parquet files. * Switched to using fastparquet * fixed type check error
…122) * workaround for memory problem reading parquet files. * Switched to using fastparquet * fixed type check error
Very small change to work around issue with pd.read_parquet(). (memory utilization exploding when trying to read a large-ish parquet file)