-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor distinct hash join to handle multiple probes with the same build table #17609
Merged
rapids-bot
merged 19 commits into
rapidsai:branch-25.02
from
PointKernel:refactor-distinct-join
Jan 4, 2025
+236
−260
Merged
Changes from 12 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
bb1679c
Improve distinct hash join to handle multiple probes with the same bu…
PointKernel 7c8b136
Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…
PointKernel 2eb6b73
Update docs + minor cleanups
PointKernel 40b6587
Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…
PointKernel 8c4d27c
East const
PointKernel 1814a6a
No const data member
PointKernel b247711
Minor cleanup: get rid of template hash
PointKernel fdb6646
Fix leftovers
PointKernel 33cd099
Use conditional find for nullable probe table
PointKernel 0f08114
Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…
PointKernel a7ee992
Update java bindings
PointKernel 3aaaed5
Fix distinct join output order
PointKernel fe06dbf
Merge remote-tracking branch 'upstream/branch-25.02' into refactor-di…
PointKernel 302c692
Update docs
PointKernel 8792f08
Update copyright year: happy 2025
PointKernel 689b02b
Remove has_nulls data member
PointKernel 74bc708
Remove a leftover
PointKernel 7a263ec
Fix nullate determination
PointKernel 8d0a61c
Merge branch 'branch-25.02' into refactor-distinct-join
PointKernel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,13 +34,6 @@ | |
|
||
namespace CUDF_EXPORT cudf { | ||
|
||
/** | ||
* @brief Enum to indicate whether the distinct join table has nested columns or not | ||
* | ||
* @ingroup column_join | ||
*/ | ||
enum class has_nested : bool { YES, NO }; | ||
|
||
// forward declaration | ||
namespace hashing::detail { | ||
|
||
|
@@ -61,7 +54,6 @@ class hash_join; | |
/** | ||
* @brief Forward declaration for our distinct hash join | ||
*/ | ||
template <cudf::has_nested HasNested> | ||
class distinct_hash_join; | ||
} // namespace detail | ||
|
||
|
@@ -469,20 +461,19 @@ class hash_join { | |
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref()) const; | ||
|
||
private: | ||
const std::unique_ptr<impl_type const> _impl; | ||
std::unique_ptr<impl_type const> _impl; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Avoid const data member per https://quuxplusone.github.io/blog/2022/01/23/dont-const-all-the-things/#data-members-never-const |
||
}; | ||
|
||
/** | ||
* @brief Distinct hash join that builds hash table in creation and probes results in subsequent | ||
* `*_join` member functions | ||
* | ||
* This class enables the distinct hash join scheme that builds hash table once, and probes as many | ||
* times as needed (possibly in parallel). | ||
* | ||
* @note Behavior is undefined if the build table contains duplicates. | ||
* @note All NaNs are considered as equal | ||
* | ||
* @tparam HasNested Flag indicating whether there are nested columns in build/probe table | ||
*/ | ||
// TODO: `HasNested` to be removed via dispatching | ||
template <cudf::has_nested HasNested> | ||
class distinct_hash_join { | ||
public: | ||
distinct_hash_join() = delete; | ||
|
@@ -496,14 +487,12 @@ class distinct_hash_join { | |
* @brief Constructs a distinct hash join object for subsequent probe calls | ||
* | ||
* @param build The build table that contains distinct elements | ||
* @param probe The probe table, from which the keys are probed | ||
* @param has_nulls Flag to indicate if there exists any nulls in the `build` table or | ||
* any `probe` table that will be used later for join | ||
* @param compare_nulls Controls whether null join-key values should match or not | ||
* @param stream CUDA stream used for device memory operations and kernel launches | ||
*/ | ||
distinct_hash_join(cudf::table_view const& build, | ||
cudf::table_view const& probe, | ||
nullable_join has_nulls = nullable_join::YES, | ||
null_equality compare_nulls = null_equality::EQUAL, | ||
rmm::cuda_stream_view stream = cudf::get_default_stream()); | ||
|
@@ -512,16 +501,18 @@ class distinct_hash_join { | |
* @brief Returns the row indices that can be used to construct the result of performing | ||
* an inner join between two tables. @see cudf::inner_join(). | ||
* | ||
* @param probe The probe table, from which the keys are probed | ||
* @param stream CUDA stream used for device memory operations and kernel launches | ||
* @param mr Device memory resource used to allocate the returned indices' device memory. | ||
* | ||
* @return A pair of columns [`build_indices`, `probe_indices`] that can be used to | ||
* @return A pair of columns [`probe_indices`, `build_indices`] that can be used to | ||
* construct the result of performing an inner join between two tables | ||
* with `build` and `probe` as the join keys. | ||
*/ | ||
[[nodiscard]] std::pair<std::unique_ptr<rmm::device_uvector<size_type>>, | ||
std::unique_ptr<rmm::device_uvector<size_type>>> | ||
inner_join(rmm::cuda_stream_view stream = cudf::get_default_stream(), | ||
inner_join(cudf::table_view const& probe, | ||
rmm::cuda_stream_view stream = cudf::get_default_stream(), | ||
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref()) const; | ||
|
||
/** | ||
|
@@ -532,19 +523,22 @@ class distinct_hash_join { | |
* the row index of the matched row from the build table if there is a match. Otherwise, contains | ||
* `JoinNoneValue`. | ||
* | ||
* @param probe The probe table, from which the keys are probed | ||
* @param stream CUDA stream used for device memory operations and kernel launches | ||
* @param mr Device memory resource used to allocate the returned table and columns' device | ||
* memory. | ||
* | ||
* @return A `build_indices` column that can be used to construct the result of | ||
* performing a left join between two tables with `build` and `probe` as the join | ||
* keys. | ||
*/ | ||
[[nodiscard]] std::unique_ptr<rmm::device_uvector<size_type>> left_join( | ||
cudf::table_view const& probe, | ||
rmm::cuda_stream_view stream = cudf::get_default_stream(), | ||
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref()) const; | ||
|
||
private: | ||
using impl_type = typename cudf::detail::distinct_hash_join<HasNested>; ///< Implementation type | ||
using impl_type = cudf::detail::distinct_hash_join; ///< Implementation type | ||
|
||
std::unique_ptr<impl_type> _impl; ///< Distinct hash join implementation | ||
}; | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do not specify the probe table, and we may not know anything about the probe tables in the future, should we set this always be
true
? Otherwise, specifyingfalse
but the probe table has nulls then the result may be incorrect.I remember that we've dealt with this same issue before, with some previous version of this join code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm following the same null check pattern used in
hash_join
, so I assume downstream users like Spark are handling this correctly. If needed, I'm happy to open a separate PR to update bothdistinct_hash_join
andhash_join
to improve this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually you reviewed that PR. Here it is: #13120.
The bug was fixed by checking
hash_nulls
on both table, like in https://github.com/rapidsai/cudf/pull/13120/files#diff-734dd746efe774e94501b6aa986e7824656182f85bf9af8bf30036c002ca1b82R82.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if Spark can handle this correctly but assuming that we stumbled upon the same issue before, I would highly suspect that it couldn't do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Spark handles this properly, e.g.:
cudf/java/src/main/native/src/TableJni.cpp
Lines 2904 to 2906 in e8935b9
Are you suggesting that we should remove
_has_null
as a data member as it's alwaystrue
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Previously we have both tables thus we can compute
has_nulls
but now we only have one table thus this variable should be removed and the corresponding nullate value should always betrue
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done