Refactor distinct hash join to handle multiple probes with the same build table #17609

PointKernel · 2024-12-17T02:41:48Z

Description

This PR updates the distinct join implementation to allow the same build table to be reused for multiple probe operations. It also introduces several breaking changes, including removing the need for users to specify whether the input data contains nested columns. Additionally, the output order has been updated to align with the hash join behavior, with probe indices now appearing on the left and build indices on the right.

The PR leverages the new conditional query API in the cuco hash set, enabling more efficient handling of nullable data. While this optimization improves performance, it is not currently reflected in benchmarks due to the absence of a dedicated test case for this scenario.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ild table

…stinct-join

copy-pr-bot · 2024-12-17T02:41:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…stinct-join

PointKernel · 2024-12-17T21:21:54Z

/ok to test

PointKernel · 2024-12-17T21:24:29Z

cpp/include/cudf/join.hpp

@@ -469,20 +461,19 @@ class hash_join {
    rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref()) const;

 private:
-  const std::unique_ptr<impl_type const> _impl;
+  std::unique_ptr<impl_type const> _impl;


Avoid const data member per https://quuxplusone.github.io/blog/2022/01/23/dont-const-all-the-things/#data-members-never-const

jlowe

Java approval

bdice

This looks great to me, I have no comments. Possible future work: I think we might want to do this same kind of thing for anti/semi-joins. Those APIs are also one-step right now, but I would like a two-step API for use in Velox.

PointKernel · 2024-12-30T18:56:04Z

I think we might want to do this same kind of thing for anti/semi-joins. Those APIs are also one-step right now, but I would like a two-step API for use in Velox.

That makes sense. I have it on my radar: #13700

vyasr · 2024-12-30T20:43:10Z

Note that this will require some meaningful refactoring of the implementation.

cpp/include/cudf/detail/distinct_hash_join.cuh

ttnghia · 2025-01-01T06:21:00Z

cpp/include/cudf/detail/distinct_hash_join.cuh

+                                           cudf::detail::cuco_allocator<char>,
+                                           cuco_storage_type>;
+
+  bool _has_nulls;           ///< True if nulls are present in either build table or probe table


Since we do not specify the probe table, and we may not know anything about the probe tables in the future, should we set this always be true? Otherwise, specifying false but the probe table has nulls then the result may be incorrect.

I remember that we've dealt with this same issue before, with some previous version of this join code.

I'm following the same null check pattern used in hash_join, so I assume downstream users like Spark are handling this correctly. If needed, I'm happy to open a separate PR to update both distinct_hash_join and hash_join to improve this.

Oh actually you reviewed that PR. Here it is: #13120.

The bug was fixed by checking hash_nulls on both table, like in https://github.com/rapidsai/cudf/pull/13120/files#diff-734dd746efe774e94501b6aa986e7824656182f85bf9af8bf30036c002ca1b82R82.

I'm not sure if Spark can handle this correctly but assuming that we stumbled upon the same issue before, I would highly suspect that it couldn't do it.

I think Spark handles this properly, e.g.:

cudf/java/src/main/native/src/TableJni.cpp

Lines 2904 to 2906 in e8935b9

auto has_nulls = cudf::has_nested_nulls(left) || cudf::has_nested_nulls(right)

? cudf::nullable_join::YES

: cudf::nullable_join::NO;

Are you suggesting that we should remove _has_null as a data member as it's always true?

Yes, I think so. Previously we have both tables thus we can compute has_nulls but now we only have one table thus this variable should be removed and the corresponding nullate value should always be true.

…stinct-join

ttnghia · 2025-01-02T22:48:59Z

cpp/src/join/distinct_hash_join.cu

+  auto const has_nulls        = _has_nulls or cudf::has_nested_nulls(probe);
+  auto const d_probe_hasher   = probe_row_hasher.device_hasher(nullate::DYNAMIC{has_nulls});


I still see problem here. When has_nulls is true due to nulls in the probe table but _has_nulls is false, the build table was hashed by a different way than the probe table. As such, two identical rows in two tables may be hashed to different values, which lead to incorrect join output.

I finally get what you mean. Should we do the same cleanup for hash_join as well?

Oh so the same issue was already merged before? I didn't know that (or maybe don't remember). Yes, I think this is a real bug and should be addressed completely. If that issue was already merged before and is not related to this PR, we can fix it in a separate work.

distinct hash join is fixed in this PR. I will address hash_join separately.

PointKernel · 2025-01-04T05:05:46Z

/merge

PointKernel added 2 commits December 16, 2024 18:12

Improve distinct hash join to handle multiple probes with the same bu…

bb1679c

…ild table

Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…

7c8b136

…stinct-join

PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. breaking Breaking change labels Dec 17, 2024

PointKernel self-assigned this Dec 17, 2024

PointKernel added 10 commits December 16, 2024 19:32

Update docs + minor cleanups

2eb6b73

Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…

40b6587

…stinct-join

East const

8c4d27c

No const data member

1814a6a

Minor cleanup: get rid of template hash

b247711

Fix leftovers

fdb6646

Use conditional find for nullable probe table

33cd099

Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…

0f08114

…stinct-join

Update java bindings

a7ee992

Fix distinct join output order

3aaaed5

github-actions bot added the Java Affects Java cuDF API. label Dec 17, 2024

PointKernel commented Dec 17, 2024

View reviewed changes

PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Dec 18, 2024

PointKernel marked this pull request as ready for review December 18, 2024 00:02

PointKernel requested review from a team as code owners December 18, 2024 00:02

PointKernel requested review from bdice and pmattione-nvidia December 18, 2024 00:02

PointKernel added the cuco cuCollections related issue label Dec 18, 2024

jlowe approved these changes Dec 19, 2024

View reviewed changes

bdice approved these changes Dec 23, 2024

View reviewed changes

ttnghia self-requested a review December 28, 2024 17:21

ttnghia reviewed Jan 1, 2025

View reviewed changes

cpp/include/cudf/detail/distinct_hash_join.cuh Outdated Show resolved Hide resolved

ttnghia reviewed Jan 1, 2025

View reviewed changes

PointKernel added 3 commits January 2, 2025 11:28

Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…

fe06dbf

…stinct-join

Update docs

302c692

PointKernel requested a review from ttnghia January 2, 2025 20:24

PointKernel added 2 commits January 2, 2025 14:37

Remove has_nulls data member

689b02b

Remove a leftover

74bc708

ttnghia reviewed Jan 2, 2025

View reviewed changes

Fix nullate determination

7a263ec

ttnghia approved these changes Jan 2, 2025

View reviewed changes

Merge branch 'branch-25.02' into refactor-distinct-join

8d0a61c

rapids-bot bot merged commit 62d72df into rapidsai:branch-25.02 Jan 4, 2025
105 of 106 checks passed

PointKernel deleted the refactor-distinct-join branch January 4, 2025 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor distinct hash join to handle multiple probes with the same build table #17609

Refactor distinct hash join to handle multiple probes with the same build table #17609

PointKernel commented Dec 17, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 17, 2024

PointKernel commented Dec 17, 2024

PointKernel Dec 17, 2024

jlowe left a comment

bdice left a comment

PointKernel commented Dec 30, 2024

vyasr commented Dec 30, 2024

ttnghia Jan 1, 2025 •

edited

Loading

PointKernel Jan 2, 2025

ttnghia Jan 2, 2025

ttnghia Jan 2, 2025

PointKernel Jan 2, 2025

ttnghia Jan 2, 2025

PointKernel Jan 2, 2025

ttnghia Jan 2, 2025

PointKernel Jan 2, 2025

ttnghia Jan 2, 2025 •

edited

Loading

PointKernel Jan 2, 2025

PointKernel commented Jan 4, 2025

	auto has_nulls = cudf::has_nested_nulls(left) \|\| cudf::has_nested_nulls(right)
	? cudf::nullable_join::YES
	: cudf::nullable_join::NO;

		auto const has_nulls = _has_nulls or cudf::has_nested_nulls(probe);
		auto const d_probe_hasher = probe_row_hasher.device_hasher(nullate::DYNAMIC{has_nulls});

Refactor distinct hash join to handle multiple probes with the same build table #17609

Refactor distinct hash join to handle multiple probes with the same build table #17609

Conversation

PointKernel commented Dec 17, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 17, 2024

PointKernel commented Dec 17, 2024

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

PointKernel commented Dec 30, 2024

vyasr commented Dec 30, 2024

ttnghia Jan 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PointKernel commented Jan 4, 2025

PointKernel commented Dec 17, 2024 •

edited

Loading

ttnghia Jan 1, 2025 •

edited

Loading

ttnghia Jan 2, 2025 •

edited

Loading