Skip to content

Let ArrowArrayStreamReader handle schema with attached metadata + do schema checking#8944

Merged
alamb merged 3 commits intoapache:mainfrom
jonded94:make-arrow-array-stream-reader-handle-metadata
Dec 9, 2025
Merged

Let ArrowArrayStreamReader handle schema with attached metadata + do schema checking#8944
alamb merged 3 commits intoapache:mainfrom
jonded94:make-arrow-array-stream-reader-handle-metadata

Conversation

@jonded94
Copy link
Contributor

@jonded94 jonded94 commented Dec 3, 2025

Which issue does this PR close? / Rationale for this change

Solves an issue discovered during #8790, namely that ArrowArrayStreamReader does not correctly expose schema-level metadata and does not check whether the StructArray constructed from the FFI stream actually in general corresponds to the expected schema.

What changes are included in this PR?

  • Change how RecordBatch is construted inside ArrowArrayStreamReader such that it holds metadata and schema validity checks are done.
  • Augment FFI tests with schema- and column-level metadata.

Are these changes tested?

Yes, both _test_round_trip_export and _test_round_trip_import now test for metadata on schema- and column-level.

Are there any user-facing changes?

Yes, ArrowArrayStreamReader now is able to export RecordBatch with schema-level metadata, and the interface has increased security since it actually checks for schema validity.

@jonded94
Copy link
Contributor Author

jonded94 commented Dec 9, 2025

@kylebarron @alamb I added another commit that now is a bit more thorough in checking for metadata intactness in the arrow-pyarrow-integration-testing tests.

@kylebarron already approved this PR, could you check again whether you are happy with the small change?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you

@alamb alamb merged commit dff6402 into apache:main Dec 9, 2025
26 checks passed
@kylebarron
Copy link
Member

@jonded94 @alamb FYI this was a breaking change for record batches with zero columns. In kylebarron/arro3#475 I bumped package versions only, with no code changes on my side, and I now have two failing tests. This test newly errors with

Exception: Invalid argument error: must either specify a row count or at least one column

This is because the old code used RecordBatch::from(StructArray), which preserved the row count from the StructArray.

The new code uses RecordBatch::try_new, which discards the row count info when there are 0 columns.

@alamb
Copy link
Contributor

alamb commented Feb 11, 2026

Thanks for the report

I'll flag #9394 as being needed for

@jonded94
Copy link
Contributor Author

@kylebarron thanks for raising this! I haven't looked into that yet in detail, but isn't this then a bug of the RecordBatch::try_new function that it spits out 0-row record batches?

@kylebarron
Copy link
Member

To clarify:

  • It's not a 0-row record batch.
  • It's a 0-column record batch, with 4 rows. This can happen if you perform a select([]) on a record batch leading to no columns, but maintaining the row length.
  • RecordBatch::try_new does error on 0-column record batches. There's a separate RecordBatch::try_new_with_options constructor for passing in a row length manually.
  • The two options for fixing are either:
    • Revert to StructArray::from, which handled the 0-column case, and then after assign the metadata
    • Add a case for using RecordBatch::try_new_with_options when the input isn't valid for RecordBatch::try_new

alamb pushed a commit that referenced this pull request Feb 13, 2026
# Which issue does this PR close?

- Closes #9394

# Rationale for this change

PR #8944 introduced a regression
that 0-column record batch streams could not longer be decoded.

# What changes are included in this PR?

- Construct `RecordBatch` with `try_new_with_options` using the `len` of
the `ArrayData`, instead of letting it try to implicitly determine `len`
by looking at the first column (this is what `try_new` does).
- Slight refactor and reduction of code duplication of the existing
`test_stream_round_trip_[import/export]` tests
- Introduction of a new `test_stream_round_trip_no_columns` test 

# Are these changes tested?

Yes, both export and import are tested in
`test_stream_round_trip_no_columns`.

# Are there any user-facing changes?

0-column record batch streams should be decodable now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments