Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

arrow-avro already supports writing Avro Object Container Files (OCF) and framed streaming encodings (e.g. Single-Object Encoding / registry wire formats). However, many systems exchange raw Avro binary datum payloads (i.e. only the Avro record body bytes) while supplying the schema out-of-band (configuration, RPC contract, topic metadata, etc.).

Without first-class support for unframed datum output, users must either:

  • accept framing overhead that downstream systems don’t expect, or
  • re-implement datum encoding themselves.

This PR adds the missing unframed write path and exposes a row-by-row encoding API to make it easy to embed Avro datums into other transport protocols.

What changes are included in this PR?

  • Added AvroBinaryFormat (unframed) as an AvroFormat implementation to emit raw Avro record body bytes (no SOE prefix and no OCF header) and to explicitly reject container-level compression for this format.
  • Added RecordEncoder::encode_rows to encode a RecordBatch into a single contiguous buffer while tracking per-row boundaries via appended offsets.
  • Introduced a higher-level Encoder + EncodedRows API for row-by-row streaming use cases, providing zero-copy access to individual row slices (via Bytes).
  • Updated the writer API to provide build_encoder for stream formats (e.g. SOE) and added row-capacity configuration to better support incremental/streaming workflows.
  • Added the bytes crate as a dependency to support efficient buffering and slicing in the row encoder, and adjusted dev-dependencies to support the new tests/docs.

Are these changes tested?

Yes.

This PR adds unit tests that cover:

  • single- and multi-column row encoding
  • nullable columns
  • prefix-based vs. unprefixed row encoding behavior
  • empty batch encoding
  • appending to existing output buffers and validating offset invariants

Are there any user-facing changes?

Yes, these changes are additive (no breaking public API changes expected).

  • New writer format support for unframed Avro binary datum output (AvroBinaryFormat).
  • New row-by-row encoding APIs (RecordEncoder::encode_rows, Encoder, EncodedRows) to support zero-copy access to per-row encoded bytes.
  • New WriterBuilder functionality (build_encoder + row-capacity configuration) to enable encoder construction without committing to a specific Write sink.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 14, 2026
- Introduced `RecordEncoder::encode_rows` to buffer encoded rows as contiguous slices with per-row offsets using `BytesMut`.
- Added `Encoder` for row-by-row Avro encoding, including zero-copy `Bytes` row access via `EncodedRows`.
- Integrated `bytes` crate for efficient encoding operations.
- Updated writer API to offer `build_encoder` for stream formats (e.g., SOE) alongside row-capacity configuration support.
- Adjusted docs to highlight new encoder capabilities.
- Comprehensive tests added to validate single/multi-column, nullable, prefix-based, and empty batch encoding scenarios.
@jecsand838 jecsand838 changed the title Add BinaryFormatSupport to arrow-avro Writer Add BinaryFormatSupport and Row Encoder to arrow-avro Writer Jan 14, 2026
@jecsand838
Copy link
Contributor Author

@mbrobbel @alamb @scovich @nathaniel-d-ef

Would any of you have bandwidth to review this PR? Much of the diff is comments and tests. I was hoping to get this out in the v58.0.0 release. This is also rather pivotal for the future direction of the arrow-avro Writer, so I'd absolutely love feedback regarding the row-wise Encoder architecture.

@jecsand838 jecsand838 force-pushed the avro-row-encoder branch 2 times, most recently from 14bc1ae to 5ded4c0 Compare January 16, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-avro] Add Avro BinaryFormat (Unframed) to writer module

1 participant