-
Notifications
You must be signed in to change notification settings - Fork 200
Reduce memory usage in delta format code paths #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage in delta format code paths #723
Conversation
0a141b3
to
ade4ee3
Compare
d2748d0
to
d99d480
Compare
d99d480
to
09e1a38
Compare
72f7bfc
to
bc0cc4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know the number of batches?
@@ -130,8 +129,10 @@ def __to_pandas_kernel(self): | |||
schema = scan.execute(interface).schema | |||
return pd.DataFrame(columns=schema.names) | |||
|
|||
table = pa.Table.from_batches(scan.execute(interface)) | |||
result = table.to_pandas() | |||
batches = scan.execute(interface) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there a parameter controlling the batch behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
delta-kernel-rs uses a batch size of 1024 rows and the parquet format code path uses a batch size of 65536 by default. Should this be added to the docstring? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise looks good!
result = pd.DataFrame(columns=schema.names) | ||
elif self._convert_in_batches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it's worth adding some logging on the batch behavior for debugging/verification purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a print statement for number of batches
1e40d45
to
492b0b9
Compare
492b0b9
to
6679f11
Compare
* fix python lint and reformat scripts (#668) * Reduce memory usage in parquet format code path (#737) * refactor: Resolve lint errors in python release script. (#660) * Reduce memory usage in delta format code paths (#723) * Update Python connector version to 1.3.3 --------- Co-authored-by: Kyle Chui <[email protected]>
Reduces memory consumption when using delta-kernel-rs by converting each batch into a pandas dataframe and concatenating the results together, rather than creating a pyarrow table from all the batches and converting that table with all the results. The pyarrow table usually remains in memory during the conversion to pandas, so it is best to convert smaller batches to pandas rather than the entire table.
This change is gated behind the
convert_in_batches
param.Tested with integration test.
Also added some unit tests for #737 involving parquet format batch convert + limit.