-
Notifications
You must be signed in to change notification settings - Fork 1k
Add merge and merge_n algorithms
#8753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The optimisation work that was done in #8653 would make sense here as well. That has not been done yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pepijnve -- what do you think about also adding benchmarks to this kernel (so that future optimizations work better)
Good idea. I’m happy to continue working on this one. I created the PR already to get the ball rolling and solicit input from other devs. |
While looking into this I realised that |
@alamb I duplicated the microbenchmark for zip as a quick fix. Is it worth trying to actually share the sets of input data and masks? If so, where should I move that code? |
Which issue does this PR close?
Rationale for this change
The algorithms suggested in this PR originate from the
caselogic in DataFusion (see datafusion#18152 and datafusion#18444). I think it might be useful to move them toarrow-rsinstead of being tucked away in a corner of the DataFusion codebase.What changes are included in this PR?
Adds a two-way and n-way merge algorithm that's halfway between
zipandinterleave. In contrast tozipthe truthy and falsy arrays do not need to be prealigned. In contrast tointerleavethe relative order of elements in each input array is retained in the final result.Are these changes tested?
I've already added two minimal unit tests, more should probably be added.
Are there any user-facing changes?
No breaking API changes