feat: Implement an AsyncReader for avro using ObjectStore #8930

EmilyMatt · 2025-11-26T10:35:05Z

Which issue does this PR close?

Closes Implement an async AvroReader #8929 .

Rationale for this change

Allows for proper file splitting within an asynchronous context.

What changes are included in this PR?

The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema

Are these changes tested?

Yes

Are there any user-facing changes?

Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)

jecsand838

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

arrow-avro/Cargo.toml

jecsand838 · 2025-11-26T21:16:35Z

arrow-avro/src/reader/async_reader.rs

+/// 5. If no range was originally provided, reads the full file.
+/// 6. If the range is 0, file_size is 0, or `range.end` is less than the header length, finish immediately.
+pub struct AsyncAvroReader {
+    store: Arc<dyn object_store::ObjectStore>,


I think the biggest high-level concern I have is the object_store hardwiring. My gut tells me we'd be better off with a generic AsyncFileReader<T: AsyncRead + AsyncSeek> or similar trait as the primary abstraction, with object_store as one feature flagged adapter imo.

Yeah, I was wondering how to implement this.
Perhaps just let the user(e.g. Datafusion) provide the impl and be completely agnostic.

Very fair. I think there's incredible value for an AsyncFileReader in arrow-avro, especially if implemented in a generic manner which is highly re-usable across downstream projects. Also object_store makes sense as a first class adapter imo.

My original intention was providing the building blocks for projects such as DataFusion to use for more concrete domain specific implementations.

I'd recommend looking into the parquet crate for inspiration. It uses an abstraction and provides a ParquetObjectReader

EmilyMatt · 2025-11-26T21:26:03Z

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

jecsand838 · 2025-11-26T21:33:18Z

Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

100% I'm working on that right now and won't stop until I have a PR. That was a solid catch.

The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back.

jecsand838 · 2025-11-26T21:53:00Z

arrow-avro/src/reader/async_reader.rs

+    pub async fn try_new(
+        store: Arc<dyn object_store::ObjectStore>,
+        location: Path,
+        range: Option<Range<u64>>,
+        file_size: u64,
+        reader_schema: Option<AvroSchema>,
+        batch_size: usize,
+    ) -> Result<Self, ArrowError> {
+        let file_size = if file_size == 0 {
+            store
+                .head(&location)
+                .await
+                .map_err(|err| {
+                    ArrowError::AvroError(format!("HEAD request failed for file, {err}"))
+                })?
+                .size
+        } else {
+            file_size
+        };


Also I'd probably consider using either a builder pattern or define a AsyncAvroReaderOptions struct for these params.

EmilyMatt · 2025-12-01T21:46:18Z

Sorry, I haven't dropped it, just found myself in a really busy week! The generic reader support does not seem to hard to implement from the dabbling I made, and I still need to get to the builder pattern change

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

EmilyMatt · 2025-12-07T15:10:23Z

@jecsand838 I believe this is now ready for a proper review^

jecsand838

@EmilyMatt Thank you so much for getting these changes up!

I left a few comments. Let me know what you think.

EDIT: Should have mentioned that this is looking really good overall and I'm very excited for the AsyncReader!

jecsand838 · 2025-12-09T00:40:46Z

arrow-avro/Cargo.toml

+# Enable async APIs
+async = ["futures"]
+# Enable object_store integration
+object_store = ["dep:object_store", "async"]


I'd recommend updating the README.md and docs with details on these new features.

I'm hesitant as I am notoriously bad at writing docs 😅
Will use Claude to try and make something

jecsand838 · 2025-12-09T00:43:45Z

arrow-avro/src/reader/async_reader/async_file_reader.rs

+/// A broad generic trait definition allowing fetching bytes from any source asynchronously.
+/// This trait has very few limitations, mostly in regard to ownership and lifetime,
+/// but it must return a boxed Future containing [`bytes::Bytes`] or an error.


You may want to provide examples on how to use this for the docs.

jecsand838 · 2025-12-09T00:57:58Z

arrow-avro/src/reader/async_reader/builder.rs

+    pub(super) reader_schema: Option<AvroSchema>,
+}
+
+impl<R: AsyncFileReader> AsyncAvroReaderBuilder<R> {


Nit, but I'd consider naming this either AsyncFileReaderBuilder or AsyncOcfReaderBuilder

Will rename to the first, tbh I'm not sure if and how we intend to handle the other types yet.

jecsand838 · 2025-12-09T01:12:03Z

arrow-avro/src/reader/async_reader/builder.rs

+                None => {
+                    let devised_avro_schema = AvroSchema::try_from(self.schema.as_ref())?;
+                    let devised_reader_schema = devised_avro_schema.schema()?;
+                    field_builder
+                        .with_reader_schema(&devised_reader_schema)
+                        .build()
+                }


Shouldn't we just execute field_builder.build without a reader_schema in this case?

The Reader treats this scenario as one where the caller simply wants to decode an OCF file without schema resolution, purely using the writer_schema.

Yeah I'm gonna remove this completely right now, since I don't really trust the conversion as well

jecsand838 · 2025-12-09T01:45:23Z

arrow-avro/src/reader/async_reader/mod.rs

+    pub fn builder(
+        reader: R,
+        file_size: u64,
+        schema: SchemaRef,


I really don't think the schema field should be required here as it subtracts from Avro's self-describing characteristic while effectively making the optional reader_schema required.

I'd recommend setting this up in a manner that encourages callers to use with_reader_schema. That way callers which simply want to read an OCF file without schema resolution are optimally supported.

If we absolutely need to support passing in an Arrow reader_schema, then I'd recommend adding an optional (and well documented) with_arrow_reader_schema method (to compliment with_reader_schema) that inputs an Arrow SchemaRef and runs AvroSchema::try_from on it.

Yeah I'm removing this completely, users should use reader schema directly if they so choose

jecsand838 · 2025-12-09T01:46:40Z

arrow-avro/src/reader/async_reader/mod.rs

+/// 4. If a block is incomplete (due to range ending mid-block), fetching the remaining bytes from the [`AsyncFileReader`].
+/// 5. If no range was originally provided, reads the full file.
+/// 6. If the range is 0, file_size is 0, or `range.end` is less than the header length, finish immediately.
+pub struct AsyncAvroReader<R: AsyncFileReader> {


Nit: but I'd also consider calling this AsyncFileReader or AsyncOcfReader.

jecsand838 · 2025-12-09T02:05:12Z

arrow-avro/src/reader/async_reader/async_file_reader.rs

+pub trait AsyncFileReader: Send + Unpin {
+    /// Fetch a range of bytes asynchronously using a custom reading method
+    fn fetch_range(&mut self, range: Range<u64>) -> DataFetchFutureBoxed;
+
+    /// Fetch a range that is beyond the originally provided file range,
+    /// such as reading the header before reading the file,
+    /// or fetching the remainder of the block in case the range ended before the block's end.
+    /// By default, this will simply point to the fetch_range function.
+    fn fetch_extra_range(&mut self, range: Range<u64>) -> DataFetchFutureBoxed {
+        self.fetch_range(range)
+    }
+}


What's your take on aligning this a bit more with the trait used in parquet and arrow/async_reader?

Suggested change

pub trait AsyncFileReader: Send + Unpin {

/// Fetch a range of bytes asynchronously using a custom reading method

fn fetch_range(&mut self, range: Range<u64>) -> DataFetchFutureBoxed;

/// Fetch a range that is beyond the originally provided file range,

/// such as reading the header before reading the file,

/// or fetching the remainder of the block in case the range ended before the block's end.

/// By default, this will simply point to the fetch_range function.

fn fetch_extra_range(&mut self, range: Range<u64>) -> DataFetchFutureBoxed {

self.fetch_range(range)

}

}

pub trait AsyncFileReader: Send {

/// Retrieve the bytes in `range`

fn get_bytes(&mut self, range: Range<u64>) -> BoxFuture<'_, Result<Bytes>>;

}

My thinking is this:

The get_bytes trait method is just "fetch these bytes". It doesn't know or care whether the range is within some "expected" range. The out-of-band reads (header, partial block completion) could be a concern of the reader logic, not the I/O trait.

Users already understand get_bytes / get_byte_ranges. Reusing that mental model reduces friction. Plus consistency across crates is generally a best practice.

This would unlock a clean default impl for AsyncRead + AsyncSeek (like tokio::fs::File) the same way Parquet does . The current'static requirement forces all implementations to be fully owned or Arc-wrapped, which seems unnecessarily rigid for simple file readers.

Indeed, but I'd like something like fetching the metadata/header in this case to be separate, parquet does this as well.

I agree with that, I will consider how to best approach this^^

while it's true that the trait does not need this restriction, it is necessary in order to write the actual code, the parquet reader also has
<T: AsyncFileReader + Send + 'static>
Otherwise you simply could not use the underlying async readers

Not only that, looking deeper it seems it also uses
impl ParquetRecordBatchStream
where
T: AsyncFileReader + Unpin + Send + 'static,

jecsand838 · 2025-12-09T03:46:35Z

arrow-avro/src/reader/async_reader/builder.rs

+        let mut decoder = HeaderDecoder::default();
+        let mut position = 0;
+        loop {
+            let range_to_fetch = position..(position + 64 * 1024).min(self.file_size);


Is there a reason for hardcoding position + 64 * 1024?

Not really, my usual files have a smaller header but I figured this is a small enough value to be inconsequential for fetches and will almost certainly mean we don't have to run the loop more than once.

Added an optional hint

jecsand838 · 2025-12-09T04:24:11Z

arrow-avro/src/reader/async_reader/mod.rs

+
+        // Should clamp to file size
+        assert_eq!(batch.num_rows(), 8);
+    }


It may also be worth adding round-trip tests using the Writer.

Added something quite minimal, but I don't know how much value it adds

jecsand838 · 2025-12-09T20:28:18Z

arrow-avro/src/reader/async_reader/object_store_reader.rs

+/// An implementation of an AsyncFileReader using the [`object_store::ObjectStore`] API.
+pub struct ObjectStoreFileReader {
+    store: Arc<dyn object_store::ObjectStore>,
+    location: object_store::path::Path,
+}
+
+impl ObjectStoreFileReader {
+    /// Creates a new [`Self`] from a store implementation and file location.
+    pub fn new(
+        store: Arc<dyn object_store::ObjectStore>,
+        location: object_store::path::Path,
+    ) -> Self {
+        Self { store, location }
+    }
+}


Another thing that occurred to me is we could support different runtimes for reading from ObjectStore and Avro decoding by following the pattern below from the ParquetObjectReader.

/// Perform IO on the provided tokio runtime /// /// Tokio is a cooperative scheduler, and relies on tasks yielding in a timely manner /// to service IO. Therefore, running IO and CPU-bound tasks, such as parquet decoding, /// on the same tokio runtime can lead to degraded throughput, dropped connections and /// other issues. For more information see [here]. /// /// [here]: https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/ pub fn with_runtime(self, handle: Handle) -> Self { Self { runtime: Some(handle), ..self } } fn spawn<F, O, E>(&self, f: F) -> BoxFuture<'_, Result<O>> where F: for<'a> FnOnce(&'a Arc<dyn ObjectStore>, &'a Path) -> BoxFuture<'a, Result<O, E>> + Send + 'static, O: Send + 'static, E: Into<ParquetError> + Send + 'static, { match &self.runtime { Some(handle) => { let path = self.path.clone(); let store = Arc::clone(&self.store); handle .spawn(async move { f(&store, &path).await }) .map_ok_or_else( |e| match e.try_into_panic() { Err(e) => Err(ParquetError::External(Box::new(e))), Ok(p) => std::panic::resume_unwind(p), }, |res| res.map_err(|e| e.into()), ) .boxed() } None => f(&self.store, &self.path).map_err(|e| e.into()).boxed(), } } }

Nice, yeah looks good, I will take another look at the parquet impl

alamb · 2026-01-10T13:12:10Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

EmilyMatt · 2026-01-11T12:24:35Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

I had actually just returned to work on it 2 days ago, still having some issues with the schema now being provided, due to the problems I've described, @jecsand838 suggested removing the arrow schema and I'm starting to think that is the only viable way for now.
Making the fetch API a bit closer to the one parquet uses is the smaller issue, I do wish to keep the seperate semantics for the original fetch and extra fetch(for parquet for example, that will be the row groups ranges, and the footer range), will try a couple ways to do this

EmilyMatt · 2026-01-11T12:24:52Z

Hope to push another version today and address some of the things above

…sync-reader

EmilyMatt · 2026-01-12T21:44:20Z

@jecsand838 I've shamelessly plagiarized the API for the object reader from the parquet crate, but that's ok IMO, it lays the foundations for a common API in a few versions.
I believe I've addressed everything, let me know if anything pops to mind

EmilyMatt added 2 commits November 26, 2025 12:28

feat: Implement an AsyncReader for avro using ObjectStore

4ed172b

Merge branch 'main' into avro-async-reader

e5c7f57

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Nov 26, 2025

This was referenced Nov 26, 2025

Support Ref types in Avro Reader. apache/datafusion#18811

Open

Improve Avro Reader Types Support apache/datafusion#18810

Open

Use arrow-avro for performance and improved type support apache/datafusion#14097

Open

EmilyMatt added 4 commits November 26, 2025 14:24

feature gate use

f5bfd35

update comments

32b0760

file size is mandatory

2251be5

finish immediately

79af114

jecsand838 reviewed Nov 26, 2025

View reviewed changes

EmilyMatt added 3 commits November 27, 2025 12:05

remove object store form default

4e207ea

remove object store form default

e04337c

Merge branch 'main' into avro-async-reader

854cd95

EmilyMatt added 2 commits December 7, 2025 17:06

Use builder pattern, fallback to get the schema from the arrow schema…

8e0e46e

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

Merge branch 'main' into avro-async-reader

4f88571

remove accidental changes

cb19fad

jecsand838 reviewed Dec 9, 2025

View reviewed changes

Merge branch 'apache:main' into avro-async-reader

d59a2f5

EmilyMatt added 6 commits January 12, 2026 23:28

Merge branch 'refs/heads/main' into avro-async-reader

e57801d

rebase, address CR

939cf8e

Merge remote-tracking branch 'upstream/avro-async-reader' into avro-a…

03de289

…sync-reader

fix some docs

a071ab8

update cfg

1cc8d8a

Add some docs

1a2169b

Add a basic roundtrip test

c6634ad

feat: Implement an AsyncReader for avro using ObjectStore #8930

Are you sure you want to change the base?

feat: Implement an AsyncReader for avro using ObjectStore #8930

Uh oh!

Conversation

EmilyMatt commented Nov 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmilyMatt Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmilyMatt commented Nov 26, 2025

Uh oh!

jecsand838 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmilyMatt commented Dec 1, 2025

Uh oh!

EmilyMatt commented Dec 7, 2025

Uh oh!

jecsand838 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

EmilyMatt Nov 26, 2025 •

edited

Loading

jecsand838 Nov 26, 2025 •

edited

Loading

jecsand838 commented Nov 26, 2025 •

edited

Loading

jecsand838 left a comment •

edited

Loading

jecsand838 Dec 9, 2025 •

edited

Loading

jecsand838 Dec 9, 2025 •

edited

Loading