Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor distinct hash join to handle multiple probes with the same build table #17609

Merged
merged 19 commits into from
Jan 4, 2025

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Dec 17, 2024

Description

This PR updates the distinct join implementation to allow the same build table to be reused for multiple probe operations. It also introduces several breaking changes, including removing the need for users to specify whether the input data contains nested columns. Additionally, the output order has been updated to align with the hash join behavior, with probe indices now appearing on the left and build indices on the right.

The PR leverages the new conditional query API in the cuco hash set, enabling more efficient handling of nullable data. While this optimization improves performance, it is not currently reflected in benchmarks due to the absence of a dedicated test case for this scenario.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. breaking Breaking change labels Dec 17, 2024
Copy link

copy-pr-bot bot commented Dec 17, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@PointKernel PointKernel self-assigned this Dec 17, 2024
@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 17, 2024
@PointKernel
Copy link
Member Author

/ok to test

@@ -469,20 +461,19 @@ class hash_join {
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref()) const;

private:
const std::unique_ptr<impl_type const> _impl;
std::unique_ptr<impl_type const> _impl;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Dec 18, 2024
@PointKernel PointKernel marked this pull request as ready for review December 18, 2024 00:02
@PointKernel PointKernel requested review from a team as code owners December 18, 2024 00:02
@PointKernel PointKernel added the cuco cuCollections related issue label Dec 18, 2024
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java approval

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, I have no comments. Possible future work: I think we might want to do this same kind of thing for anti/semi-joins. Those APIs are also one-step right now, but I would like a two-step API for use in Velox.

@ttnghia ttnghia self-requested a review December 28, 2024 17:21
@PointKernel
Copy link
Member Author

I think we might want to do this same kind of thing for anti/semi-joins. Those APIs are also one-step right now, but I would like a two-step API for use in Velox.

That makes sense. I have it on my radar: #13700

@vyasr
Copy link
Contributor

vyasr commented Dec 30, 2024

Note that this will require some meaningful refactoring of the implementation.

cudf::detail::cuco_allocator<char>,
cuco_storage_type>;

bool _has_nulls; ///< True if nulls are present in either build table or probe table
Copy link
Contributor

@ttnghia ttnghia Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not specify the probe table, and we may not know anything about the probe tables in the future, should we set this always be true? Otherwise, specifying false but the probe table has nulls then the result may be incorrect.

I remember that we've dealt with this same issue before, with some previous version of this join code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following the same null check pattern used in hash_join, so I assume downstream users like Spark are handling this correctly. If needed, I'm happy to open a separate PR to update both distinct_hash_join and hash_join to improve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually you reviewed that PR. Here it is: #13120.

The bug was fixed by checking hash_nulls on both table, like in https://github.com/rapidsai/cudf/pull/13120/files#diff-734dd746efe774e94501b6aa986e7824656182f85bf9af8bf30036c002ca1b82R82.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if Spark can handle this correctly but assuming that we stumbled upon the same issue before, I would highly suspect that it couldn't do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Spark handles this properly, e.g.:

auto has_nulls = cudf::has_nested_nulls(left) || cudf::has_nested_nulls(right)
? cudf::nullable_join::YES
: cudf::nullable_join::NO;

Are you suggesting that we should remove _has_null as a data member as it's always true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Previously we have both tables thus we can compute has_nulls but now we only have one table thus this variable should be removed and the corresponding nullate value should always be true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@PointKernel PointKernel requested a review from ttnghia January 2, 2025 20:24
Comment on lines 159 to 160
auto const has_nulls = _has_nulls or cudf::has_nested_nulls(probe);
auto const d_probe_hasher = probe_row_hasher.device_hasher(nullate::DYNAMIC{has_nulls});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see problem here. When has_nulls is true due to nulls in the probe table but _has_nulls is false, the build table was hashed by a different way than the probe table. As such, two identical rows in two tables may be hashed to different values, which lead to incorrect join output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally get what you mean. Should we do the same cleanup for hash_join as well?

Copy link
Contributor

@ttnghia ttnghia Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so the same issue was already merged before? I didn't know that (or maybe don't remember). Yes, I think this is a real bug and should be addressed completely. If that issue was already merged before and is not related to this PR, we can fix it in a separate work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinct hash join is fixed in this PR. I will address hash_join separately.

@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 62d72df into rapidsai:branch-25.02 Jan 4, 2025
105 of 106 checks passed
@PointKernel PointKernel deleted the refactor-distinct-join branch January 4, 2025 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change cuco cuCollections related issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants