Skip to content

Conversation

@jonded94
Copy link
Contributor

@jonded94 jonded94 commented Nov 5, 2025

Rationale for this change

When dealing with Parquet files that have an exceedingly large amount of Binary or UTF8 data in one row group, there can be issues when returning a single RecordBatch because of index overflows (#7973).

In pyarrow this is usually solved by representing data as a pyarrow.Table object whose columns are ChunkedArrays, which basically are just lists of Arrow Arrays, or alternatively, the pyarrow.Table is just a representation of a list of RecordBatches.

I'd like to build a function in PyO3 that returns a pyarrow.Table, very similar to pyarrow's read_row_group method. With that, we could have feature parity with pyarrow in circumstances of potential index overflows without resorting to type changes (such as reading the data as LargeString or StringView columns).
Currently, AFAIS, there is no way in arrow-pyarrow to export a pyarrow.Table directly. Especially convenience methods from Vec<RecordBatch> seem to be missing. This PR tries to implement a convenience wrapper that allows directly exporting pyarrow.Table.

What changes are included in this PR?

A new struct Table in the crate arrow-pyarrow is added which can be constructed from Vec<RecordBatch> or from ArrowArrayStreamReader.
It implements FromPyArrow and IntoPyArrow.

FromPyArrow will support anything that either implements the ArrowStreamReader protocol or is a RecordBatchReader, or has a to_reader() method which does that. pyarrow.Table does both of these things.
IntoPyArrow will result int a pyarrow.Table on the Python side, constructed through pyarrow.Table.from_batches(...).

Are these changes tested?

No, not yet. Please let me know whether you are in general fine with this PR, then I'll work on tests. So far I only tested it locally with very easy PyO3 dummy functions basically doing a round-trip, and with them everything worked:

#[pyfunction]
pub fn roundtrip_table(table: PyArrowType<Table>) -> PyArrowType<Table> {
    table
}

#[pyfunction]
pub fn build_table(record_batches: Vec<PyArrowType<RecordBatch>>) -> PyArrowType<Table> {
    PyArrowType(Table::try_new(record_batches.into_iter().map(|rb| rb.0).collect()).unwrap())
}

=>

>>> import pyo3parquet
>>> import pyarrow
>>> table = pyarrow.Table.from_pylist([{"foo": 1}])
>>> pyo3parquet.roundtrip_table(table)
pyarrow.Table
foo: int64
----
foo: [[1]]
>>> pyo3parquet.build_table(table.to_batches())
pyarrow.Table
foo: int64
----
foo: [[1]]

The real tests of course would be much more sophisticated than just this.

Are there any user-facing changes?

A new Table convience wrapper is added!

Copy link
Member

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

}

pub fn try_new(record_batches: Vec<RecordBatch>) -> Result<Self, ArrowError> {
let schema = record_batches[0].schema();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An Arrow table can be empty with no batches. It would probably be more reliable to store both batches and a standalone schema.

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 6, 2025

Thanks @kylebarron for your very quick review! ❤️

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

Yes, I'm also not too sure about it, that's why I just sketched out a rough implementation without tests so far. A reason why I think this potentially could be nice to have in arrow-pyarrow is that the documentation even mentions that there is no equivalent concept to pyarrow.Table in arrow-pyarrow and that one has to do slight workarounds to use them:

PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn’t have these same concepts. A chunked table is instead represented with Vec. A pyarrow.Table can be imported to Rust by calling pyarrow.Table.to_reader() and then importing the reader as a [ArrowArrayStreamReader].

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec<RecordBatch> on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust. One still could mention in the documentation that generally, streaming approaches are highly preferred, and that the pyarrow.Table convenience wrapper shall only be used in cases where users know what they're doing.

Slightly nicer Python workflow

In our very specific example, we have a Python class with a function such as this one:

class ParquetFile:
  def read_row_group(self, index: int) -> pyarrow.RecordBatch: ...

In the issue I linked this unfortunately breaks down for a specific parquet file since a particular row group isn't expressable as a single RecordBatch without changing types somewhere. Either you'd have to change the underlying Arrow types from String to LargeString or StringView, or you change the returned type from pyarrow.RecordBatch to Iterator[pyarrow.RecordBatch] for example (or RecordBatchReader or any other streaming-capable object).

The latter comes with a bit of syntactic shortcomings in contexts where you want to apply .to_pylist() on whatever read_row_group(...) returns:

rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]]
if isinstance(rg, pyarrow.RecordBatch):
  python_objs = rg.to_pylist()
else:
  python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for batch in rg))

With pyarrow.Table, there already exists a thing which simplifies this a lot on the Python side:

rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]] = rg.to_pylist()

And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

Yes, in general, I much prefer the approach of arro3 to be totally pyarrow agnostic. In our case unfortunately, we're right now still pretty hardcoded against pyarrow specifics and just use arrow-rs as a means to reduce memory load compared to reading & writing parquet datasets with pyarrow directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants