Add BinaryFormatSupport and Row Encoder to `arrow-avro` Writer by jecsand838 · Pull Request #9171 · apache/arrow-rs

jecsand838 · 2026-01-14T09:41:32Z

Which issue does this PR close?

Closes [arrow-avro] Add Avro BinaryFormat (Unframed) to writer module #8701.

Rationale for this change

arrow-avro already supports writing Avro Object Container Files (OCF) and framed streaming encodings (e.g. Single-Object Encoding / registry wire formats). However, many systems exchange raw Avro binary datum payloads (i.e. only the Avro record body bytes) while supplying the schema out-of-band (configuration, RPC contract, topic metadata, etc.).

Without first-class support for unframed datum output, users must either:

accept framing overhead that downstream systems don’t expect, or
re-implement datum encoding themselves.

This PR adds the missing unframed write path and exposes a row-by-row encoding API to make it easy to embed Avro datums into other transport protocols.

What changes are included in this PR?

Added AvroBinaryFormat (unframed) as an AvroFormat implementation to emit raw Avro record body bytes (no SOE prefix and no OCF header) and to explicitly reject container-level compression for this format.
Added RecordEncoder::encode_rows to encode a RecordBatch into a single contiguous buffer while tracking per-row boundaries via appended offsets.
Introduced a higher-level Encoder + EncodedRows API for row-by-row streaming use cases, providing zero-copy access to individual row slices (via Bytes).
Updated the writer API to provide build_encoder for stream formats (e.g. SOE) and added row-capacity configuration to better support incremental/streaming workflows.
Added the bytes crate as a dependency to support efficient buffering and slicing in the row encoder, and adjusted dev-dependencies to support the new tests/docs.

Are these changes tested?

Yes.

This PR adds unit tests that cover:

single- and multi-column row encoding
nullable columns
prefix-based vs. unprefixed row encoding behavior
empty batch encoding
appending to existing output buffers and validating offset invariants

Are there any user-facing changes?

Yes, these changes are additive (no breaking public API changes expected).

New writer format support for unframed Avro binary datum output (AvroBinaryFormat).
New row-by-row encoding APIs (RecordEncoder::encode_rows, Encoder, EncodedRows) to support zero-copy access to per-row encoded bytes.
New WriterBuilder functionality (build_encoder + row-capacity configuration) to enable encoder construction without committing to a specific Write sink.

- Introduced `RecordEncoder::encode_rows` to buffer encoded rows as contiguous slices with per-row offsets using `BytesMut`. - Added `Encoder` for row-by-row Avro encoding, including zero-copy `Bytes` row access via `EncodedRows`. - Integrated `bytes` crate for efficient encoding operations. - Updated writer API to offer `build_encoder` for stream formats (e.g., SOE) alongside row-capacity configuration support. - Adjusted docs to highlight new encoder capabilities. - Comprehensive tests added to validate single/multi-column, nullable, prefix-based, and empty batch encoding scenarios.

jecsand838 · 2026-01-15T23:38:28Z

@mbrobbel @alamb @scovich @nathaniel-d-ef

Would any of you have bandwidth to review this PR? Much of the diff is comments and tests. I was hoping to get this out in the v58.0.0 release. This is also rather pivotal for the future direction of the arrow-avro Writer, so I'd absolutely love feedback regarding the row-wise Encoder architecture.

nathaniel-d-ef

Conceptually this looks solid to me, though I'll defer to those with more advanced Rust knowledge to get in the weeds on performance. This should quite valuable to systems that just need the bytes - good work 👍

alamb

Looks good to me @jecsand838

I have some small additional test suggestions

jecsand838#1

And some API suggestions / questions, but nothing I think is necessary before merge

Let me know how you would like to proceed

alamb · 2026-01-24T01:26:33Z

arrow-avro/src/writer/mod.rs

+        // self.len() is defined as self.offsets.len().saturating_sub(1).
+        // The check `i >= self.len()` above ensures that `i < self.offsets.len() - 1`.
+        // Therefore, both `i` and `i + 1` are strictly within the bounds of `self.offsets`.
+        let (start_u64, end_u64) = unsafe {


die you see this use of unsafe make a difference in benchmarks?

I did see a difference surprisingly.

In the screenshot below I run the benchmarks first with the unsafe code before changing the production code to be safe and re-running. There seemed to be a significant performance impact.

NOTE: For the safe test I used let (start_u64, end_u64) = (self.offsets[i], self.offsets[i + 1]);.

I made sure to push up the benches I used for this in a new benches/encoder.rs file, which can be expanded on in future PRs.

alamb · 2026-01-24T01:29:16Z

arrow-avro/src/writer/mod.rs

+    /// # }
+    /// ```
+    pub fn rows(&self) -> impl Iterator<Item = Result<Bytes, ArrowError>> + '_ {
+        (0..self.len()).map(|i| self.row(i))


This is likely more efficient if you returned the sliced Bytes directly -- calling row will continually recheck len, for example

You could do something like this to get known good iffsets

self.offsets.iter().windows(2).map(...)

This was a great call out. I went ahead and implemented those changes and renamed the method from rows to iter which seemed more idiomatic.

alamb · 2026-01-24T01:30:37Z

arrow-avro/src/writer/mod.rs

+    pub fn to_vecs(&self) -> Result<Vec<Vec<u8>>, ArrowError> {
+        let mut out = Vec::with_capacity(self.len());
+        for i in 0..self.len() {
+            out.push(self.row(i)?.to_vec());
+        }
+        Ok(out)
+    }


This seems like an unnecessary API to me -- you could do it the same with

let vecs: Vec<_> = rows.iter().map(|v| v.to_vec()).collect()

100% great catch. I was overthinking this. Ended up removing to_vecs in my latest push and updated the documentation / examples to better showcase this.

alamb · 2026-01-24T01:32:36Z

arrow-avro/src/writer/mod.rs

    }
 }

+/// A row-by-row streaming encoder for Avro **Single Object Encoding** (SOE) streams.


I wonder why a user couldn't just use Writer with a mut Vec as the the sink - you would get the same effect

Is the difference that you get the output offsets as well?

Great question! At the byte level Writer<_, AvroSoeFormat> writing into a Vec<u8> does produce the same concatenated output stream.

The reason for Encoder however is that neither SOE nor the Confluent/Apicurio wire formats include a length field (SOE is just 0xC3 0x01 + 8-byte hashed fingerprint + body while Confluent is magic byte + 4-byte schema id + body). So once multiple rows are written into a single Vec, there’s no cheap or 100% reliable--especially for wire formats--way to split it back into per-row payloads without either decoding or getting hacky. Support for binary format was essentially blocked since those payloads aren't framed at all and therefore have no makeshift delimiter to scan for / split by.

Additionally, I hit performance bottlenecks when developing message-oriented sinks (Kafka/Pulsar/etc.) downstream of arrow-avro. These were incurred from having to use the Writer to encode 1-row batches and tracking Vec lengths, which is much less efficient due to repeated per-call setups and per-row allocations + copies.

The new Encoder solves this while enabling binary format by recording row-end offsets during encoding and returning zero-copy Bytes slices per row (via EncodedRows).

Add additional test coverage

jecsand838 · 2026-01-24T08:12:39Z

Looks good to me @jecsand838

I have some small additional test suggestions

Add additional test coverage jecsand838/arrow-rs#1

And some API suggestions / questions, but nothing I think is necessary before merge

Let me know how you would like to proceed

@alamb Thank you so much for the review and for the tests!

I ended up merging your PR in and pushing up some changes to address the comments you left. I think your recommendations were solid and worth getting in now. Also I left some answers to your questions over the design.

Let me know what you think when you get a chance.

alamb

Looks good to me -- thanks @jecsand838

alamb · 2026-01-24T13:06:28Z

arrow-avro/src/writer/mod.rs

-        let (start_u64, end_u64) = unsafe {
+        // The check `n >= self.len()` above ensures that `n < self.offsets.len() - 1`.
+        // Therefore, both `n` and `n + 1` are strictly within the bounds of `self.offsets`.
+        let (start, end) = unsafe {


using usize rather than u64 seems like a nice cleaup

100%, that became apparent to me rather quickly lol.

alamb · 2026-01-24T13:07:06Z

arrow-avro/src/writer/mod.rs

+    pub fn iter(&self) -> impl ExactSizeIterator<Item = Bytes> + '_ {
+        self.offsets.windows(2).map(|w| {
+            debug_assert!(w[0] <= w[1] && w[1] <= self.data.len());
+            self.data.slice(w[0]..w[1])


given you are using slice here I suspect the extra debug assert is not necessary as the slice also does the same check

Ah yes, you are correct. I went ahead and removed the extra debug assert.

alamb · 2026-01-26T22:13:52Z

Sorry -- this now has a conflict (likely due to the new AvroError)

jecsand838 · 2026-01-26T22:27:22Z

Sorry -- this now has a conflict (likely due to the new AvroError)

@alamb No worries! I just pushed up the changes to resolve the conflicts and use AvroError.

alamb · 2026-01-27T01:20:27Z

🚀

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 14, 2026

jecsand838 force-pushed the avro-row-encoder branch from 2d57316 to 37efbe2 Compare January 14, 2026 09:48

jecsand838 force-pushed the avro-row-encoder branch from 37efbe2 to ce72c5e Compare January 14, 2026 09:58

jecsand838 changed the title ~~Add BinaryFormatSupport to arrow-avro Writer~~ Add BinaryFormatSupport and Row Encoder to arrow-avro Writer Jan 14, 2026

jecsand838 force-pushed the avro-row-encoder branch 2 times, most recently from 14bc1ae to 5ded4c0 Compare January 16, 2026 00:29

Merge branch 'main' into avro-row-encoder

8a5e951

jecsand838 force-pushed the avro-row-encoder branch from 5ded4c0 to 8a5e951 Compare January 16, 2026 07:56

nathaniel-d-ef reviewed Jan 23, 2026

View reviewed changes

Add additional test coverage

f41ffb0

alamb mentioned this pull request Jan 24, 2026

Add additional test coverage jecsand838/arrow-rs#1

Merged

alamb approved these changes Jan 24, 2026

View reviewed changes

Merge pull request #1 from alamb/alamb/tests

09b688c

Add additional test coverage

jecsand838 force-pushed the avro-row-encoder branch from 9891885 to 09b688c Compare January 24, 2026 03:28

Address PR Comments

172566b

alamb approved these changes Jan 24, 2026

View reviewed changes

jecsand838 added 3 commits January 24, 2026 17:37

Address PR Comments

b51f14f

Merge branch 'main' into avro-row-encoder

c9f7ee4

Merge branch 'main' into avro-row-encoder

504cf95

jecsand838 force-pushed the avro-row-encoder branch from c5f0fa1 to 3d24391 Compare January 26, 2026 22:20

Merge branch 'main' into avro-row-encoder

4e12a1f

jecsand838 force-pushed the avro-row-encoder branch from 3d24391 to 4e12a1f Compare January 26, 2026 22:26

alamb merged commit fab8e75 into apache:main Jan 27, 2026
24 checks passed

jecsand838 deleted the avro-row-encoder branch January 27, 2026 01:53

Conversation

jecsand838 commented Jan 14, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 commented Jan 15, 2026

Uh oh!

nathaniel-d-ef left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Jan 24, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 26, 2026

Uh oh!

jecsand838 commented Jan 26, 2026

Uh oh!

Uh oh!

alamb commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants