Skip to content

Conversation

@pgm
Copy link
Contributor

@pgm pgm commented Nov 5, 2024

Very small change to work around issue with pd.read_parquet(). (memory utilization exploding when trying to read a large-ish parquet file)

@pgm pgm changed the title workaround for memory problem reading parquet files. fix(breadbox): workaround for memory problem reading parquet files. Nov 5, 2024
# to > 30GB and would take down breadbox. Reading it in as a table, and then
# processing it column by column seems to avoid this problem. Not 100% sure that this
# results in identical behavior, but it seems valid for numerical matrices
table = pq.read_table(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fixes the issue, I think it looks good and I feel good about you merging it (especially since it's needed for the release).

But I am curious... I found this github issue, suggesting the problem is the pyarrow engine that pandas uses. I'd be curious if you could use a simpler solution of just doing:

pandas.read_parquet(engine='fastparquet')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory leak described there looks fairly different then how I'd characterize what I've observed. However, that's a good point about maybe the issue stems from pyarrow and not pandas.

I tried

pandas.read_parquet(engine='fastparquet')

and yes, that works fine. We'll have to add fastparquet as a dependency, but that sounds fine.

I'll switch the code to use that instead.

Copy link
Member

@snwessel snwessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I had one question/comment, but we can revisit it later if it's helpful to just merge this in now

@pgm pgm merged commit 066a806 into master Nov 6, 2024
15 checks passed
pgm added a commit that referenced this pull request Nov 7, 2024
…122)

* workaround for memory problem reading parquet files.

* Switched to using fastparquet

* fixed type check error
pgm added a commit that referenced this pull request Nov 7, 2024
…122)

* workaround for memory problem reading parquet files.

* Switched to using fastparquet

* fixed type check error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants