Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-partition Join support to cuDF-Polars #17518

Merged
merged 23 commits into from
Feb 4, 2025

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Dec 4, 2024

Description

Adds multi-partition Join support following the same design as #17441

In order to support parallel joins, this PR also introduces a special Shuffle node.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Dec 4, 2024
@rjzamora rjzamora self-assigned this Dec 4, 2024
Copy link

copy-pr-bot bot commented Dec 4, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Dec 4, 2024
@rjzamora
Copy link
Member Author

rjzamora commented Dec 4, 2024

/ok to test

@rjzamora rjzamora marked this pull request as ready for review December 6, 2024 14:16
@rjzamora rjzamora requested a review from a team as a code owner December 6, 2024 14:16
@rjzamora rjzamora requested review from bdice and Matt711 December 6, 2024 14:16
@rjzamora rjzamora changed the title [WIP] Add multi-partition Join support to cuDF-Polars Add multi-partition Join support to cuDF-Polars Jan 11, 2025
@rjzamora rjzamora changed the base branch from branch-25.02 to branch-25.04 January 27, 2025 17:30
rapids-bot bot pushed a commit that referenced this pull request Jan 29, 2025
This PR pulls out the `Shuffle` logic from #17518 to simplify the review process.

The goal is to establish the shuffle groundwork for multi-partition `Join` and `Sort` operations.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #17744
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work! I have some small questions/suggestions.

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora! This PR looks good to me, but I would like to get @wence- thoughts too.

@rjzamora
Copy link
Member Author

I would like to get @wence- thoughts too.

Agree - Any thoughts here @wence- ? :)

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks pretty good. It would be good to have a few comments describing the algorithm for the join cases that are not inner joins (why the shuffle of both sides is needed, etc...)

shuffle_options,
output_count,
)
new_node = ir.reconstruct([left, right])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should implement short-circuiting for equal children in reconstruct (so that ir.reconstruct(ir.children) just returns ir)

Comment on lines 107 to 111
# Avoid the broadcast if the "large" table is already shuffled
other_shuffled = (
partition_info[other].partitioned_on == other_on
and partition_info[other].count == output_count
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is this overly restrictive? For example, suppose I have an already shuffled (large) left frame and a right frame that is 100 rows. It seems like in that scenario I would probably not want to shuffle the right frame, but rather broadcast it. Even though the left frame is already shuffled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I'm skeptical that this is actually restrictive.

If our large table is already shuffled, then we almost certainly want to hash-partition the smaller table and send off the necessary pieces to each partition of the large table. I don't think we gain much of a performance advantage by skipping the part where we hash-partition the small table. We do, however, loose our ability to avoid shuffling for follow-up joins on the same columns.

I suppose we may still want to perform a broadcast join if the small table is a single partition, and we are not doing any additional joins on the same columns? Even in that case, the performance is probably comparable.

) -> tuple[IR, MutableMapping[IR, PartitionInfo]]:
how = ir.options[0].lower()
if how != "inner":
shuffle_options: dict[str, Any] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this future-proofing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. I suppose I'm expecting that we will eventually extract shuffle options during Join/Sort/GroupBy translation enable/disable use of the shuffle service. That said, I think a lot will change when we introduce a shuffle-service Join.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rick, looks good!

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Feb 3, 2025
@rjzamora
Copy link
Member Author

rjzamora commented Feb 4, 2025

/merge

@rapids-bot rapids-bot bot merged commit a477a6b into rapidsai:branch-25.04 Feb 4, 2025
107 checks passed
@rjzamora rjzamora deleted the cudf-polars-multi-join branch February 4, 2025 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants