Skip to content

Commit f92ff18

Browse files
authored
Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: (#7597)
# Which issue does this PR close? - Part of #6692 - Part of #7589 # Rationale for this change The pattern of combining multiple small RecordBatches to form one larger one for subsequent processing is common in query engines like DataFusion which filter or partition incoming Arrays. Current best practice is to use the `filter` or `take` kernels and then `concat` kernels as explained in - #6692 This pattern also appears in my attempt to improve parquet filter performance (to cache the result of applying a filter rather than re-decoding the results). See - apache/datafusion#3463 - #7513 The current pattern is non optimal as it requires: 1. At least 2x peak memory (holding the input and output of `concat`) 2. 2 copies of the data (to create the output of `filter` and then create the output of `concat`) The theory is that with sufficient optimization we can reduce the peak memory requirements and (possibly) make it faster as well. However, to add a benchmark for this filter+concat, I basically had nothing to benchmark. Specifically, there needed to be an API to call. - Note I also made a PR to DataFusion showing this API can be used and it is not slower: apache/datafusion#16249 # What changes are included in this PR? I ported the code from DataFusion downstream upstream into arrow-rs so 1. We can use it in the parquet reader 2. We can benchmark and optimize it appropriately 1. Add `BatchCoalescer` to `arrow-select`, and tests 2. Update documentation 2. Add examples 3. Add a `pub` export in `arrow` 4. Add Benchmark # Are there any user-facing changes? This is a new API. I next plan to make an benchmark for this particular
1 parent 6deefb7 commit f92ff18

File tree

9 files changed

+1058
-4
lines changed

9 files changed

+1058
-4
lines changed

arrow-select/src/coalesce.rs

Lines changed: 629 additions & 0 deletions
Large diffs are not rendered by default.

arrow-select/src/filter.rs

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -156,10 +156,16 @@ pub fn prep_null_mask_filter(filter: &BooleanArray) -> BooleanArray {
156156
BooleanArray::new(mask, None)
157157
}
158158

159-
/// Returns a filtered `values` [Array] where the corresponding elements of
159+
/// Returns a filtered `values` [`Array`] where the corresponding elements of
160160
/// `predicate` are `true`.
161161
///
162-
/// See also [`FilterBuilder`] for more control over the filtering process.
162+
/// # See also
163+
/// * [`FilterBuilder`] for more control over the filtering process.
164+
/// * [`filter_record_batch`] to filter a [`RecordBatch`]
165+
/// * [`BatchCoalescer`]: to filter multiple [`RecordBatch`] and coalesce
166+
/// the results into a single array.
167+
///
168+
/// [`BatchCoalescer`]: crate::coalesce::BatchCoalescer
163169
///
164170
/// # Example
165171
/// ```rust

arrow-select/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
#![cfg_attr(docsrs, feature(doc_auto_cfg))]
2525
#![warn(missing_docs)]
2626

27+
pub mod coalesce;
2728
pub mod concat;
2829
mod dictionary;
2930
pub mod filter;

arrow-select/src/take.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,12 @@ use num::{One, Zero};
6464
///
6565
/// When `options` is not set to check bounds, taking indexes after `len` will panic.
6666
///
67+
/// # See also
68+
/// * [`BatchCoalescer`]: to filter multiple [`RecordBatch`] and coalesce
69+
/// the results into a single array.
70+
///
71+
/// [`BatchCoalescer`]: crate::coalesce::BatchCoalescer
72+
///
6773
/// # Examples
6874
/// ```
6975
/// # use arrow_array::{StringArray, UInt32Array, cast::AsArray};

arrow/Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,12 @@ name = "filter_kernels"
161161
harness = false
162162
required-features = ["test_utils"]
163163

164+
[[bench]]
165+
name = "coalesce_kernels"
166+
harness = false
167+
required-features = ["test_utils"]
168+
169+
164170
[[bench]]
165171
name = "take_kernels"
166172
harness = false

0 commit comments

Comments
 (0)