Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(breadbox): workaround for memory problem reading parquet files. #122

Merged
merged 3 commits into from
Nov 6, 2024

Conversation

pgm
Copy link
Contributor

@pgm pgm commented Nov 5, 2024

Very small change to work around issue with pd.read_parquet(). (memory utilization exploding when trying to read a large-ish parquet file)

@pgm pgm changed the title workaround for memory problem reading parquet files. fix(breadbox): workaround for memory problem reading parquet files. Nov 5, 2024
# to > 30GB and would take down breadbox. Reading it in as a table, and then
# processing it column by column seems to avoid this problem. Not 100% sure that this
# results in identical behavior, but it seems valid for numerical matrices
table = pq.read_table(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fixes the issue, I think it looks good and I feel good about you merging it (especially since it's needed for the release).

But I am curious... I found this github issue, suggesting the problem is the pyarrow engine that pandas uses. I'd be curious if you could use a simpler solution of just doing:

pandas.read_parquet(engine='fastparquet')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory leak described there looks fairly different then how I'd characterize what I've observed. However, that's a good point about maybe the issue stems from pyarrow and not pandas.

I tried

pandas.read_parquet(engine='fastparquet')

and yes, that works fine. We'll have to add fastparquet as a dependency, but that sounds fine.

I'll switch the code to use that instead.

Copy link
Member

@snwessel snwessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I had one question/comment, but we can revisit it later if it's helpful to just merge this in now

@pgm pgm merged commit 066a806 into master Nov 6, 2024
15 checks passed
pgm added a commit that referenced this pull request Nov 7, 2024
…122)

* workaround for memory problem reading parquet files.

* Switched to using fastparquet

* fixed type check error
pgm added a commit that referenced this pull request Nov 7, 2024
…122)

* workaround for memory problem reading parquet files.

* Switched to using fastparquet

* fixed type check error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants