Skip to content

GH-38558: [C++] Add support for null sort option per sort key #46926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

Taepper
Copy link

@Taepper Taepper commented Jun 27, 2025

See #38584 for original PR. Will be quoted for this PR description.

Rationale for this change

support multi sortkey nulls first.

order by i nulls first, j, k nulls first;

The current null sorting only supports all sortkeys, not a certain sortkey, so NullPlacement is extended to the SortKey field. Since the underlying framework is very well written, when modifying this function, you only need to pass the null_placement of each SortKey in. That’s it.

What changes are included in this PR?

1.SortKey structure, NullPlacemnt transfer logic, sorting logic and Ording related, test related
2.Substriait related.
3.c_glib related.
4.SelectK related.
5.RankOptions related.

Are these changes tested?

yes, I changed the code inside vector_sort_test.cc and performed additional tests.

Are there any user-facing changes?

yes, pg database include null sorting of multiple sort keys.

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

I amended the original PR to be less breaking in public APIs.

Still Ordering, SortOptions, RankOptions, and RankQuantileOptions now accept a std::optional<NullPlacement> instead of NullPlacement, which did lead to some changes in downstream APIs and bindings. I also need some help with fixing the c_glib bindings.

Light-City and others added 30 commits November 9, 2023 09:57
1.Reconstruct the SortKey structure and add NullPlacement.

2.Remove NullPlacement from SortOptions

3.Fix selectk not displaying non-empty results in null AtEnd scenario.

When limit k is greater than the actual table data and the table contains Null/NaN, the data cannot be obtained and only non-empty results are available.
Therefore, we support returning non-null and supporting the order of setting Null for each SortKey.

4.Add relevant unit tests and change the interface implemented by multiple versions
…8558

# Conflicts:
#	c_glib/arrow-glib/compute.cpp
#	c_glib/arrow-glib/compute.h
#	cpp/src/arrow/compute/kernels/vector_rank.cc
#	cpp/src/arrow/compute/kernels/vector_select_k.cc
#	cpp/src/arrow/compute/kernels/vector_sort.cc
#	cpp/src/arrow/compute/kernels/vector_sort_internal.h
#	python/pyarrow/_acero.pyx
#	python/pyarrow/_compute.pyx
#	python/pyarrow/array.pxi
#	python/pyarrow/tests/test_compute.py
#	python/pyarrow/tests/test_table.py
# Conflicts:
#	cpp/src/arrow/compute/api_vector.cc
#	cpp/src/arrow/compute/api_vector.h
#	cpp/src/arrow/compute/kernels/vector_rank.cc
#	cpp/src/arrow/compute/kernels/vector_select_k.cc
#	cpp/src/arrow/compute/kernels/vector_sort.cc
#	cpp/src/arrow/compute/kernels/vector_sort_internal.h
#	cpp/src/arrow/compute/kernels/vector_sort_test.cc
#	cpp/src/arrow/compute/ordering.cc
#	cpp/src/arrow/compute/ordering.h
@Taepper Taepper changed the title GH-38558 GH-38558: [C++] Add support for null sort option per sort key Jun 27, 2025
@AlenkaF AlenkaF removed their request for review June 30, 2025 04:07
@Taepper
Copy link
Author

Taepper commented Jun 30, 2025

Note that I fixed the failing CI runs on my fork

@pitrou
Copy link
Member

pitrou commented Jul 1, 2025

Hi @Taepper , thanks for submitting this.

Design-wise, I think there are two possible APIs here:

  1. (As done in this PR) Add a required NullPlacement in SortKey, and make the NullPlacement in SortOptions optional and deprecated
  2. Add an optional NullPlacement in SortKey, and keep the required NullPlacement in SortOptions as fallback

Option 1 has the advantage that it's conceptually simpler once the deprecation period is over, but it comes with a minor API change and a slightly complicated deprecation period.

Option 2 is conceptually a bit more complicated (per-key + global fallback) but avoids breaking the current API and doesn't introduce any deprecation.

I'm not sure which one is better. @zanmato1984 @felipecrv Thoughts?

@Taepper
Copy link
Author

Taepper commented Jul 1, 2025

Thank you for your comments!

I had the same considerations. I opted for option 1 because it provides the clear path forward for the post-deprecation API.

Maybe option 2 is worth more consideration for the missing API breakage.

Should the SelectKOptions (which did not have a NullPlacement before) receive an additional global fallback, or should it implement their fallback internally and non-configurable?

@pitrou
Copy link
Member

pitrou commented Jul 1, 2025

Well, the API breakage isn't critical either IMHO. I'd just like to have opinions from other core developers. @lidavidm Perhaps?

@lidavidm
Copy link
Member

lidavidm commented Jul 1, 2025

As long as we're still doing major releases, then I think the slight breakage is acceptable in return for a nicer API in the end.

@Taepper
Copy link
Author

Taepper commented Jul 1, 2025

Alright, also note that this tries to be analogous to the deprecation of skip_nulls in SetLookupOptions:

https://github.com/apache/arrow/pull/36739/files#diff-6bc7ecec6a4f7bcefc2511cde3bd809340ad0d94bb8f7cc5f4994063c798f2faR313

@pitrou
Copy link
Member

pitrou commented Jul 1, 2025

Alright, also note that this tries to be analogous to the deprecation of skip_nulls in SetLookupOptions:

https://github.com/apache/arrow/pull/36739/files#diff-6bc7ecec6a4f7bcefc2511cde3bd809340ad0d94bb8f7cc5f4994063c798f2faR313

That's a good comparison point, thank you.

@zanmato1984
Copy link
Contributor

I prefer a conceptually simpler model and sacrifice the API compatibility:

  1. (As done in this PR) Add a required NullPlacement in SortKey, and make the NullPlacement in SortOptions optional and deprecated

@pitrou
Copy link
Member

pitrou commented Jul 2, 2025

Ok, it seems everyone agrees with this approach, so let's go for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants