Implement a `Vec<RecordBatch>` wrapper for `pyarrow.Table` convenience #8790

jonded94 · 2025-11-05T21:29:30Z

Rationale for this change

When dealing with Parquet files that have an exceedingly large amount of Binary or UTF8 data in one row group, there can be issues when returning a single RecordBatch because of index overflows (#7973).

In pyarrow this is usually solved by representing data as a pyarrow.Table object whose columns are ChunkedArrays, which basically are just lists of Arrow Arrays, or alternatively, the pyarrow.Table is just a representation of a list of RecordBatches.

I'd like to build a function in PyO3 that returns a pyarrow.Table, very similar to pyarrow's read_row_group method. With that, we could have feature parity with pyarrow in circumstances of potential index overflows without resorting to type changes (such as reading the data as LargeString or StringView columns).
Currently, AFAIS, there is no way in arrow-pyarrow to export a pyarrow.Table directly. Especially convenience methods from Vec<RecordBatch> seem to be missing. This PR tries to implement a convenience wrapper that allows directly exporting pyarrow.Table.

What changes are included in this PR?

A new struct Table in the crate arrow-pyarrow is added which can be constructed from Vec<RecordBatch> or from ArrowArrayStreamReader.
It implements FromPyArrow and IntoPyArrow.

FromPyArrow will support anything that either implements the ArrowStreamReader protocol or is a RecordBatchReader, or has a to_reader() method which does that. pyarrow.Table does both of these things.
IntoPyArrow will result int a pyarrow.Table on the Python side, constructed through pyarrow.Table.from_batches(...).

Are these changes tested?

No, not yet. Please let me know whether you are in general fine with this PR, then I'll work on tests. So far I only tested it locally with very easy PyO3 dummy functions basically doing a round-trip, and with them everything worked:

#[pyfunction]
pub fn roundtrip_table(table: PyArrowType<Table>) -> PyArrowType<Table> {
    table
}

#[pyfunction]
pub fn build_table(record_batches: Vec<PyArrowType<RecordBatch>>) -> PyArrowType<Table> {
    PyArrowType(Table::try_new(record_batches.into_iter().map(|rb| rb.0).collect()).unwrap())
}

=>

>>> import pyo3parquet
>>> import pyarrow
>>> table = pyarrow.Table.from_pylist([{"foo": 1}])
>>> pyo3parquet.roundtrip_table(table)
pyarrow.Table
foo: int64
----
foo: [[1]]
>>> pyo3parquet.build_table(table.to_batches())
pyarrow.Table
foo: int64
----
foo: [[1]]

The real tests of course would be much more sophisticated than just this.

Are there any user-facing changes?

A new Table convience wrapper is added!

kylebarron

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

kylebarron · 2025-11-05T22:14:43Z

arrow-pyarrow/src/lib.rs

+    }
+
+    pub fn try_new(record_batches: Vec<RecordBatch>) -> Result<Self, ArrowError> {
+        let schema = record_batches[0].schema();


An Arrow table can be empty with no batches. It would probably be more reliable to store both batches and a standalone schema.

jonded94 · 2025-11-06T08:33:15Z

Thanks @kylebarron for your very quick review! ❤️

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

Yes, I'm also not too sure about it, that's why I just sketched out a rough implementation without tests so far. A reason why I think this potentially could be nice to have in arrow-pyarrow is that the documentation even mentions that there is no equivalent concept to pyarrow.Table in arrow-pyarrow and that one has to do slight workarounds to use them:

PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn’t have these same concepts. A chunked table is instead represented with Vec. A pyarrow.Table can be imported to Rust by calling pyarrow.Table.to_reader() and then importing the reader as a [ArrowArrayStreamReader].

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec<RecordBatch> on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust. One still could mention in the documentation that generally, streaming approaches are highly preferred, and that the pyarrow.Table convenience wrapper shall only be used in cases where users know what they're doing.

Slightly nicer Python workflow

In our very specific example, we have a Python class with a function such as this one:

class ParquetFile:
  def read_row_group(self, index: int) -> pyarrow.RecordBatch: ...

In the issue I linked this unfortunately breaks down for a specific parquet file since a particular row group isn't expressable as a single RecordBatch without changing types somewhere. Either you'd have to change the underlying Arrow types from String to LargeString or StringView, or you change the returned type from pyarrow.RecordBatch to Iterator[pyarrow.RecordBatch] for example (or RecordBatchReader or any other streaming-capable object).

The latter comes with a bit of syntactic shortcomings in contexts where you want to apply .to_pylist() on whatever read_row_group(...) returns:

rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]]
if isinstance(rg, pyarrow.RecordBatch):
  python_objs = rg.to_pylist()
else:
  python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for batch in rg))

With pyarrow.Table, there already exists a thing which simplifies this a lot on the Python side:

rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]] = rg.to_pylist()

And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

Yes, in general, I much prefer the approach of arro3 to be totally pyarrow agnostic. In our case unfortunately, we're right now still pretty hardcoded against pyarrow specifics and just use arrow-rs as a means to reduce memory load compared to reading & writing parquet datasets with pyarrow directly.

Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience

baa197c

jonded94 mentioned this pull request Nov 5, 2025

Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) #7973

Open

jonded94 added 3 commits November 5, 2025 22:31

CQ fixes

d2db562

CQ fix

3abe3b1

CQ fix

8d85dad

kylebarron reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement a `Vec<RecordBatch>` wrapper for `pyarrow.Table` convenience #8790

Implement a `Vec<RecordBatch>` wrapper for `pyarrow.Table` convenience #8790

jonded94 commented Nov 5, 2025 •

edited

Loading

Uh oh!

kylebarron left a comment

Uh oh!

kylebarron Nov 5, 2025

Uh oh!

jonded94 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience #8790

Are you sure you want to change the base?

Implement a Vec<RecordBatch> wrapper for pyarrow.Table convenience #8790

Conversation

jonded94 commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kylebarron left a comment

Choose a reason for hiding this comment

Uh oh!

kylebarron Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

jonded94 commented Nov 6, 2025

Slightly nicer Python workflow

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement a `Vec<RecordBatch>` wrapper for `pyarrow.Table` convenience #8790

Implement a `Vec<RecordBatch>` wrapper for `pyarrow.Table` convenience #8790

jonded94 commented Nov 5, 2025 •

edited

Loading