-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor distinct hash join to handle multiple probes with the same build table #17609
Refactor distinct hash join to handle multiple probes with the same build table #17609
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
@@ -469,20 +461,19 @@ class hash_join { | |||
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref()) const; | |||
|
|||
private: | |||
const std::unique_ptr<impl_type const> _impl; | |||
std::unique_ptr<impl_type const> _impl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid const data member per https://quuxplusone.github.io/blog/2022/01/23/dont-const-all-the-things/#data-members-never-const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, I have no comments. Possible future work: I think we might want to do this same kind of thing for anti/semi-joins. Those APIs are also one-step right now, but I would like a two-step API for use in Velox.
That makes sense. I have it on my radar: #13700 |
Note that this will require some meaningful refactoring of the implementation. |
cudf::detail::cuco_allocator<char>, | ||
cuco_storage_type>; | ||
|
||
bool _has_nulls; ///< True if nulls are present in either build table or probe table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do not specify the probe table, and we may not know anything about the probe tables in the future, should we set this always be true
? Otherwise, specifying false
but the probe table has nulls then the result may be incorrect.
I remember that we've dealt with this same issue before, with some previous version of this join code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm following the same null check pattern used in hash_join
, so I assume downstream users like Spark are handling this correctly. If needed, I'm happy to open a separate PR to update both distinct_hash_join
and hash_join
to improve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually you reviewed that PR. Here it is: #13120.
The bug was fixed by checking hash_nulls
on both table, like in https://github.com/rapidsai/cudf/pull/13120/files#diff-734dd746efe774e94501b6aa986e7824656182f85bf9af8bf30036c002ca1b82R82.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if Spark can handle this correctly but assuming that we stumbled upon the same issue before, I would highly suspect that it couldn't do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Spark handles this properly, e.g.:
cudf/java/src/main/native/src/TableJni.cpp
Lines 2904 to 2906 in e8935b9
auto has_nulls = cudf::has_nested_nulls(left) || cudf::has_nested_nulls(right) | |
? cudf::nullable_join::YES | |
: cudf::nullable_join::NO; |
Are you suggesting that we should remove _has_null
as a data member as it's always true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Previously we have both tables thus we can compute has_nulls
but now we only have one table thus this variable should be removed and the corresponding nullate value should always be true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/join/distinct_hash_join.cu
Outdated
auto const has_nulls = _has_nulls or cudf::has_nested_nulls(probe); | ||
auto const d_probe_hasher = probe_row_hasher.device_hasher(nullate::DYNAMIC{has_nulls}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see problem here. When has_nulls
is true
due to nulls in the probe table but _has_nulls
is false
, the build
table was hashed by a different way than the probe table. As such, two identical rows in two tables may be hashed to different values, which lead to incorrect join output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally get what you mean. Should we do the same cleanup for hash_join
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh so the same issue was already merged before? I didn't know that (or maybe don't remember). Yes, I think this is a real bug and should be addressed completely. If that issue was already merged before and is not related to this PR, we can fix it in a separate work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distinct hash join is fixed in this PR. I will address hash_join
separately.
/merge |
62d72df
into
rapidsai:branch-25.02
Description
This PR updates the distinct join implementation to allow the same build table to be reused for multiple probe operations. It also introduces several breaking changes, including removing the need for users to specify whether the input data contains nested columns. Additionally, the output order has been updated to align with the hash join behavior, with probe indices now appearing on the left and build indices on the right.
The PR leverages the new conditional query API in the cuco hash set, enabling more efficient handling of nullable data. While this optimization improves performance, it is not currently reflected in benchmarks due to the absence of a dedicated test case for this scenario.
Checklist