Reduce memory usage in delta format code paths #723

PatrickJin-db · 2025-05-13T00:02:42Z

Reduces memory consumption when using delta-kernel-rs by converting each batch into a pandas dataframe and concatenating the results together, rather than creating a pyarrow table from all the batches and converting that table with all the results. The pyarrow table usually remains in memory during the conversion to pandas, so it is best to convert smaller batches to pandas rather than the entire table.

This change is gated behind the convert_in_batches param.

Tested with integration test.

Also added some unit tests for #737 involving parquet format batch convert + limit.

linzhou-db

do we know the number of batches?

linzhou-db · 2025-05-29T02:52:18Z

python/delta_sharing/reader.py

@@ -130,8 +129,10 @@ def __to_pandas_kernel(self):
            schema = scan.execute(interface).schema
            return pd.DataFrame(columns=schema.names)

-        table = pa.Table.from_batches(scan.execute(interface))
-        result = table.to_pandas()
+        batches = scan.execute(interface)


isn't there a parameter controlling the batch behavior?

PatrickJin-db · 2025-05-29T17:27:26Z

do we know the number of batches?

delta-kernel-rs uses a batch size of 1024 rows and the parquet format code path uses a batch size of 65536 by default. Should this be added to the docstring?

linzhou-db

otherwise looks good!

linzhou-db · 2025-05-29T20:11:27Z

python/delta_sharing/reader.py

                result = pd.DataFrame(columns=schema.names)
+            elif self._convert_in_batches:


do you think it's worth adding some logging on the batch behavior for debugging/verification purpose?

added a print statement for number of batches

* fix python lint and reformat scripts (#668) * Reduce memory usage in parquet format code path (#737) * refactor: Resolve lint errors in python release script. (#660) * Reduce memory usage in delta format code paths (#723) * Update Python connector version to 1.3.3 --------- Co-authored-by: Kyle Chui <[email protected]>

PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch from 0a141b3 to ade4ee3 Compare May 13, 2025 00:03

PatrickJin-db closed this May 13, 2025

PatrickJin-db reopened this May 13, 2025

PatrickJin-db marked this pull request as draft May 13, 2025 00:04

PatrickJin-db changed the title ~~Patrick jin db/kernel explicit iterator to pandas args~~ Reduce memory usage in delta format code path May 13, 2025

PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch 2 times, most recently from d2748d0 to d99d480 Compare May 23, 2025 00:07

PatrickJin-db marked this pull request as ready for review May 23, 2025 00:09

PatrickJin-db changed the title ~~Reduce memory usage in delta format code path~~ Reduce memory usage in delta format code paths May 23, 2025

PatrickJin-db requested a review from linzhou-db May 24, 2025 02:52

PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch from d99d480 to 09e1a38 Compare May 24, 2025 02:58

explicit batch iteration in kernel path

d664901

PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch 2 times, most recently from 72f7bfc to bc0cc4b Compare May 29, 2025 16:56

linzhou-db reviewed May 29, 2025

View reviewed changes

PatrickJin-db requested a review from linzhou-db May 29, 2025 17:36

linzhou-db approved these changes May 29, 2025

View reviewed changes

PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch 2 times, most recently from 1e40d45 to 492b0b9 Compare May 29, 2025 20:42

add args to to_pandas, update cdf path, gate behind convert_in_batches

6679f11

PatrickJin-db force-pushed the PatrickJin-db/kernel-explicit-iterator-to_pandas-args branch from 492b0b9 to 6679f11 Compare May 29, 2025 20:49

PatrickJin-db merged commit afb25b1 into delta-io:main May 29, 2025
5 of 6 checks passed

PatrickJin-db added a commit to PatrickJin-db/delta-sharing that referenced this pull request May 29, 2025

Reduce memory usage in delta format code paths (delta-io#723)

03f1665

PatrickJin-db mentioned this pull request May 29, 2025

Release Python Connector 1.3.3 #740

Merged

PatrickJin-db added a commit to PatrickJin-db/delta-sharing that referenced this pull request May 29, 2025

Reduce memory usage in delta format code paths (delta-io#723)

6c8e92a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce memory usage in delta format code paths #723

Reduce memory usage in delta format code paths #723

Uh oh!

PatrickJin-db commented May 13, 2025 •

edited

Loading

Uh oh!

linzhou-db left a comment

Uh oh!

linzhou-db May 29, 2025

Uh oh!

PatrickJin-db May 29, 2025

Uh oh!

PatrickJin-db commented May 29, 2025

Uh oh!

linzhou-db left a comment

Uh oh!

linzhou-db May 29, 2025

Uh oh!

PatrickJin-db May 29, 2025

Uh oh!

Uh oh!

Uh oh!

		result = pd.DataFrame(columns=schema.names)
		elif self._convert_in_batches:

Reduce memory usage in delta format code paths #723

Reduce memory usage in delta format code paths #723

Uh oh!

Conversation

PatrickJin-db commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linzhou-db left a comment

Choose a reason for hiding this comment

Uh oh!

linzhou-db May 29, 2025

Choose a reason for hiding this comment

Uh oh!

PatrickJin-db May 29, 2025

Choose a reason for hiding this comment

Uh oh!

PatrickJin-db commented May 29, 2025

Uh oh!

linzhou-db left a comment

Choose a reason for hiding this comment

Uh oh!

linzhou-db May 29, 2025

Choose a reason for hiding this comment

Uh oh!

PatrickJin-db May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PatrickJin-db commented May 13, 2025 •

edited

Loading