Add a 'prefetch' option to `ParquetRecordBatchStream` to load the next row group while decoding #6676

masonh22 · 2024-11-03T19:56:17Z

Which issue does this PR close?

Closes #6559.

Rationale for this change

This improves performance when reading from filesystems with high latency and/or low bandwidth.

What changes are included in this PR?

This adds an option to ParquetRecordBatchStream to load the next row group while decoding the current one. This adds a new state called Prefetch to the stream state. In this state, a future for the next row group is polled before returning data from the current ParquetRecordBatchReader.

Are there any user-facing changes?

There is a new option for ParquetRecordBatchStreamBuilder called prefetch that is set by a method called with_prefetch.

commit 28b2bf1 Author: Mason Hall <[email protected]> Date: Fri Oct 18 15:49:31 2024 -0400 Cleaned up prefetch and added a test commit 3d7e018 Author: Mason Hall <[email protected]> Date: Fri Oct 18 13:32:22 2024 -0400 prefetch working

tustvold

I'll try to find some time over the next week to look into this, but I'm afraid it may be a while before I have time to sit down with this, the logic here is rather subtle and breakages can be very hard to spot / detect.

tustvold · 2024-11-03T23:51:58Z

parquet/src/arrow/async_reader/mod.rs

-                },
+                }
+                StreamState::Prefetch(batch_reader, f) => {
+                    let mut noop_cx = Context::from_waker(futures::task::noop_waker_ref());


What is the rationale for doing this?

I wanted to avoid any potential overhead from using the real context when polling the future here. Since we're always returning Poll::Ready out of this state (or transitioning to another state), we don't need to rely on the real context to wake the main stream future.

I'm not an expert at async rust code though so if it would make more sense to do something else here I'm happy to make that change.

masonh22 added 3 commits October 26, 2024 23:14

Squashed commit of the following:

4f682b1

commit 28b2bf1 Author: Mason Hall <[email protected]> Date: Fri Oct 18 15:49:31 2024 -0400 Cleaned up prefetch and added a test commit 3d7e018 Author: Mason Hall <[email protected]> Date: Fri Oct 18 13:32:22 2024 -0400 prefetch working

Fix future not waking due to noop_cx

d9c71b9

rustfmt and clippy fixes

9eb6570

github-actions bot added the parquet Changes to the parquet crate label Nov 3, 2024

tustvold reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a 'prefetch' option to `ParquetRecordBatchStream` to load the next row group while decoding #6676

Add a 'prefetch' option to `ParquetRecordBatchStream` to load the next row group while decoding #6676

masonh22 commented Nov 3, 2024

tustvold left a comment

tustvold Nov 3, 2024

masonh22 Nov 4, 2024

Add a 'prefetch' option to ParquetRecordBatchStream to load the next row group while decoding #6676

Are you sure you want to change the base?

Add a 'prefetch' option to ParquetRecordBatchStream to load the next row group while decoding #6676

Conversation

masonh22 commented Nov 3, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

tustvold Nov 3, 2024

Choose a reason for hiding this comment

masonh22 Nov 4, 2024

Choose a reason for hiding this comment

Add a 'prefetch' option to `ParquetRecordBatchStream` to load the next row group while decoding #6676

Add a 'prefetch' option to `ParquetRecordBatchStream` to load the next row group while decoding #6676