Skip to content

Reduce memory usage in delta format code paths #723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

PatrickJin-db
Copy link
Collaborator

@PatrickJin-db PatrickJin-db commented May 13, 2025

Reduces memory consumption when using delta-kernel-rs by converting each batch into a pandas dataframe and concatenating the results together, rather than creating a pyarrow table from all the batches and converting that table with all the results. The pyarrow table usually remains in memory during the conversion to pandas, so it is best to convert smaller batches to pandas rather than the entire table.

This change is gated behind the convert_in_batches param.

Tested with integration test.

Also added some unit tests for #737 involving parquet format batch convert + limit.

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch from 0a141b3 to ade4ee3 Compare May 13, 2025 00:03
@PatrickJin-db PatrickJin-db reopened this May 13, 2025
@PatrickJin-db PatrickJin-db marked this pull request as draft May 13, 2025 00:04
@PatrickJin-db PatrickJin-db changed the title Patrick jin db/kernel explicit iterator to pandas args Reduce memory usage in delta format code path May 13, 2025
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch 2 times, most recently from d2748d0 to d99d480 Compare May 23, 2025 00:07
@PatrickJin-db PatrickJin-db marked this pull request as ready for review May 23, 2025 00:09
@PatrickJin-db PatrickJin-db changed the title Reduce memory usage in delta format code path Reduce memory usage in delta format code paths May 23, 2025
@PatrickJin-db PatrickJin-db requested a review from linzhou-db May 24, 2025 02:52
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch from d99d480 to 09e1a38 Compare May 24, 2025 02:58
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch 2 times, most recently from 72f7bfc to bc0cc4b Compare May 29, 2025 16:56
Copy link
Collaborator

@linzhou-db linzhou-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know the number of batches?

@@ -130,8 +129,10 @@ def __to_pandas_kernel(self):
schema = scan.execute(interface).schema
return pd.DataFrame(columns=schema.names)

table = pa.Table.from_batches(scan.execute(interface))
result = table.to_pandas()
batches = scan.execute(interface)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a parameter controlling the batch behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@PatrickJin-db
Copy link
Collaborator Author

do we know the number of batches?

delta-kernel-rs uses a batch size of 1024 rows and the parquet format code path uses a batch size of 65536 by default. Should this be added to the docstring?

@PatrickJin-db PatrickJin-db requested a review from linzhou-db May 29, 2025 17:36
Copy link
Collaborator

@linzhou-db linzhou-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise looks good!

result = pd.DataFrame(columns=schema.names)
elif self._convert_in_batches:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it's worth adding some logging on the batch behavior for debugging/verification purpose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a print statement for number of batches

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch 2 times, most recently from 1e40d45 to 492b0b9 Compare May 29, 2025 20:42
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch from 492b0b9 to 6679f11 Compare May 29, 2025 20:49
@PatrickJin-db PatrickJin-db merged commit afb25b1 into delta-io:main May 29, 2025
5 of 6 checks passed
PatrickJin-db added a commit to PatrickJin-db/delta-sharing that referenced this pull request May 29, 2025
PatrickJin-db added a commit to PatrickJin-db/delta-sharing that referenced this pull request May 29, 2025
PatrickJin-db added a commit that referenced this pull request May 29, 2025
* fix python lint and reformat scripts (#668)

* Reduce memory usage in parquet format code path (#737)

* refactor: Resolve lint errors in python release script. (#660)

* Reduce memory usage in delta format code paths (#723)

* Update Python connector version to 1.3.3

---------

Co-authored-by: Kyle Chui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants