Conversation
|
The optimisation work that was done in #8653 would make sense here as well. That has not been done yet. |
Good idea. I’m happy to continue working on this one. I created the PR already to get the ball rolling and solicit input from other devs. |
While looking into this I realised that |
@alamb I duplicated the microbenchmark for zip as a quick fix. Is it worth trying to actually share the sets of input data and masks? If so, where should I move that code? |
| /// | ||
| /// ``` | ||
| pub fn merge_n(values: &[&dyn Array], indices: &[impl MergeIndex]) -> Result<ArrayRef, ArrowError> { | ||
| let data_type = values[0].data_type(); |
There was a problem hiding this comment.
There is no check for empty values array.
There was a problem hiding this comment.
Check added along with unit tests
arrow-select/src/merge.rs
Outdated
There was a problem hiding this comment.
| let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len()); | |
| let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, mask.len()); |
arrow-select/src/merge.rs
Outdated
There was a problem hiding this comment.
| /// # Safety | |
| /// # Panics |
arrow-select/src/merge.rs
Outdated
There was a problem hiding this comment.
| /// An index for the [merge] function. | |
| /// An index for the [merge_n] function. |
| &mut group, | ||
| &masks, | ||
| &array_1_10pct_nulls, | ||
| &non_null_scalar_1, |
There was a problem hiding this comment.
The arguments here look exactly the same as for array_vs_non_null_scalar above. I think the last two arguments should be swapped.
There was a problem hiding this comment.
Indeed. I had copied these from zip_kernel.rs which has the same mistake. Fixing here and in zip_kernel.rs.
|
There's a failing test case, but I can't say I see the relationship with this change. |
I believe it hit this issue (which has subsequently been fixed) |
|
I merged up to rerun |
|
@martin-g I think all comments have been addressed. Thanks to the test fixes on |
There was a problem hiding this comment.
|
|
||
| /// Merges two arrays in the order specified by a boolean mask. | ||
| /// | ||
| /// This algorithm is a variant of [zip] that does not require the truthy and |
There was a problem hiding this comment.
👍 thank you for clarifying the difference
|
I'll take care of adding some extra tests. |
|
Holes in the coverage have been filled. I don't think I've covered all possible permutations, but we're at full line coverage at least. |

Which issue does this PR close?
Rationale for this change
The algorithms suggested in this PR originate from the
caselogic in DataFusion (see datafusion#18152 and datafusion#18444). I think it might be useful to move them toarrow-rsinstead of being tucked away in a corner of the DataFusion codebase.What changes are included in this PR?
Adds a two-way and n-way merge algorithm that's halfway between
zipandinterleave. In contrast tozipthe truthy and falsy arrays do not need to be prealigned. In contrast tointerleavethe relative order of elements in each input array is retained in the final result.Are these changes tested?
I've already added two minimal unit tests, more should probably be added.
Are there any user-facing changes?
No breaking API changes