-
Notifications
You must be signed in to change notification settings - Fork 1k
Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience
#8790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience
#8790
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.
I don't know what the stance of maintainers is towards including a Table construct for python integration.
FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)
| } | ||
|
|
||
| pub fn try_new(record_batches: Vec<RecordBatch>) -> Result<Self, ArrowError> { | ||
| let schema = record_batches[0].schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An Arrow table can be empty with no batches. It would probably be more reliable to store both batches and a standalone schema.
|
Thanks @kylebarron for your very quick review! ❤️
Yes, I'm also not too sure about it, that's why I just sketched out a rough implementation without tests so far. A reason why I think this potentially could be nice to have in
At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Slightly nicer Python workflowIn our very specific example, we have a Python class with a function such as this one: class ParquetFile:
def read_row_group(self, index: int) -> pyarrow.RecordBatch: ...In the issue I linked this unfortunately breaks down for a specific parquet file since a particular row group isn't expressable as a single The latter comes with a bit of syntactic shortcomings in contexts where you want to apply rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]]
if isinstance(rg, pyarrow.RecordBatch):
python_objs = rg.to_pylist()
else:
python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for batch in rg))With rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]] = rg.to_pylist()And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.
Yes, in general, I much prefer the approach of |
Rationale for this change
When dealing with Parquet files that have an exceedingly large amount of Binary or UTF8 data in one row group, there can be issues when returning a single RecordBatch because of index overflows (#7973).
In
pyarrowthis is usually solved by representing data as apyarrow.Tableobject whose columns areChunkedArrays, which basically are just lists of Arrow Arrays, or alternatively, thepyarrow.Tableis just a representation of a list ofRecordBatches.I'd like to build a function in PyO3 that returns a
pyarrow.Table, very similar to pyarrow's read_row_group method. With that, we could have feature parity withpyarrowin circumstances of potential index overflows without resorting to type changes (such as reading the data asLargeStringorStringViewcolumns).Currently, AFAIS, there is no way in
arrow-pyarrowto export apyarrow.Tabledirectly. Especially convenience methods fromVec<RecordBatch>seem to be missing. This PR tries to implement a convenience wrapper that allows directly exportingpyarrow.Table.What changes are included in this PR?
A new struct
Tablein the cratearrow-pyarrowis added which can be constructed fromVec<RecordBatch>or fromArrowArrayStreamReader.It implements
FromPyArrowandIntoPyArrow.FromPyArrowwill support anything that either implements the ArrowStreamReader protocol or is a RecordBatchReader, or has ato_reader()method which does that.pyarrow.Tabledoes both of these things.IntoPyArrowwill result int apyarrow.Tableon the Python side, constructed throughpyarrow.Table.from_batches(...).Are these changes tested?
No, not yet. Please let me know whether you are in general fine with this PR, then I'll work on tests. So far I only tested it locally with very easy PyO3 dummy functions basically doing a round-trip, and with them everything worked:
=>
The real tests of course would be much more sophisticated than just this.
Are there any user-facing changes?
A new
Tableconvience wrapper is added!