Add BinaryFormatSupport and Row Encoder to arrow-avro Writer
#9171
+1,282
−62
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
arrow-avroalready supports writing Avro Object Container Files (OCF) and framed streaming encodings (e.g. Single-Object Encoding / registry wire formats). However, many systems exchange raw Avro binary datum payloads (i.e. only the Avro record body bytes) while supplying the schema out-of-band (configuration, RPC contract, topic metadata, etc.).Without first-class support for unframed datum output, users must either:
This PR adds the missing unframed write path and exposes a row-by-row encoding API to make it easy to embed Avro datums into other transport protocols.
What changes are included in this PR?
AvroBinaryFormat(unframed) as anAvroFormatimplementation to emit raw Avro record body bytes (no SOE prefix and no OCF header) and to explicitly reject container-level compression for this format.RecordEncoder::encode_rowsto encode aRecordBatchinto a single contiguous buffer while tracking per-row boundaries via appended offsets.Encoder+EncodedRowsAPI for row-by-row streaming use cases, providing zero-copy access to individual row slices (viaBytes).build_encoderfor stream formats (e.g. SOE) and added row-capacity configuration to better support incremental/streaming workflows.bytescrate as a dependency to support efficient buffering and slicing in the row encoder, and adjusted dev-dependencies to support the new tests/docs.Are these changes tested?
Yes.
This PR adds unit tests that cover:
Are there any user-facing changes?
Yes, these changes are additive (no breaking public API changes expected).
AvroBinaryFormat).RecordEncoder::encode_rows,Encoder,EncodedRows) to support zero-copy access to per-row encoded bytes.WriterBuilderfunctionality (build_encoder+ row-capacity configuration) to enable encoder construction without committing to a specificWritesink.