-
Notifications
You must be signed in to change notification settings - Fork 955
Use stream pool for gather/scatter. #14162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
Conversation
Gather (scatter) summary:
|
This is probably not going to do much for the listed issue. The issue here is the number of raw thrust and kernel calls stemming from nesting (think thousands of columns underneath the top level table). The fix for that is going to be smarter parallelization of what is currently the recursive cpu-side approach. We have some ideas here. |
// only a single column, the fork/join overhead should be avoided. | ||
auto streams = std::vector<rmm::cuda_stream_view>{}; | ||
if (num_columns > 1) { | ||
streams = cudf::detail::fork_streams(stream, num_columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious how this works for per-thread default stream. For spark, we build cuDF with PTDS. Will streams
will be have number-of-columns vector of the PTDS stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stream
passed in would be the PTDS stream. Then a stream pool (for that thread) would be created (or reused), the work would be executed across that stream pool, and then the join
step would insert events for all the elements of streams
to be synchronized with stream
before new work on stream
(the PTDS stream in Spark's case) would be runnable.
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
@mythrocks Could you pick this up? I just merged some changes that pull it up-to-date with |
I'm on it, mate. Thank you for this change. I'll post here with any findings. |
// only a single column, the fork/join overhead should be avoided. | ||
auto streams = std::vector<rmm::cuda_stream_view>{}; | ||
if (num_columns > 1) { | ||
streams = cudf::detail::fork_streams(stream, num_columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will emit a warning when there are more columns than streams in the pool. We can remove with warning; otherwise it's probably good to limit the number of streams to the pool size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good/recommended number? And is it tuned by hand or automatically computed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to use all streams in the pool, I don't think we need to tune this.
but getting more reuses them in round-robin fashion and we have a warning because maybe it's an unexpected behavior for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew we did round-robin, I did not know we issued a warning. I think removing the warning would be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #18236
} | ||
|
||
auto it = thrust::make_counting_iterator<size_type>(0); | ||
|
||
std::transform(it, it + num_columns, result.begin(), [&](size_type i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thrust::tabulate saves a bit of code here
Co-authored-by: MithunR <[email protected]>
I'll retarget this to 25.06. This will be good to have, but it didn't help NVIDIA/spark-rapids#12195. It simply wasn't the slow path, in this case. :/ |
Moved to 25.08. @mythrocks said that he should be able to get this done for the next release. |
Description
This PR uses the stream pool introduced in #13922 to gather/scatter each column in a table on a separate stream.
Related: #13509, which this might resolve (need to verify).
Checklist