Skip to content

Conversation

@pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Oct 31, 2025

Which issue does this PR close?

Rationale for this change

The algorithms suggested in this PR originate from the case logic in DataFusion (see datafusion#18152 and datafusion#18444). I think it might be useful to move them to arrow-rs instead of being tucked away in a corner of the DataFusion codebase.

What changes are included in this PR?

Adds a two-way and n-way merge algorithm that's halfway between zip and interleave. In contrast to zip the truthy and falsy arrays do not need to be prealigned. In contrast to interleave the relative order of elements in each input array is retained in the final result.

Are these changes tested?

I've already added two minimal unit tests, more should probably be added.

Are there any user-facing changes?

No breaking API changes

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 31, 2025
@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 31, 2025

The optimisation work that was done in #8653 would make sense here as well. That has not been done yet.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pepijnve -- what do you think about also adding benchmarks to this kernel (so that future optimizations work better)

@pepijnve
Copy link
Contributor Author

what do you think about also adding benchmarks to this kernel

Good idea. I’m happy to continue working on this one. I created the PR already to get the ball rolling and solicit input from other devs.

@pepijnve
Copy link
Contributor Author

pepijnve commented Nov 2, 2025

The optimisation work that was done in #8653 would make sense here as well. That has not been done yet.

While looking into this I realised that merge on scalars is effectively identical to zip so I resolved this by delegating to zip in case of scalar input

@pepijnve
Copy link
Contributor Author

pepijnve commented Nov 2, 2025

what do you think about also adding benchmarks to this kernel

@alamb I duplicated the microbenchmark for zip as a quick fix. Is it worth trying to actually share the sets of input data and masks? If so, where should I move that code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide algorithm that allows zipping arrays whose values are not prealigned

2 participants