-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements and simplifications for fixed size row-based rolling windows #17623
base: branch-25.02
Are you sure you want to change the base?
Conversation
This way, the responsibility of a producing an iterator that is in bounds falls on the caller, and we don't have to do extra work in cases where the input is guaranteed to be in bounds already.
When passing column_view objects for preceding and following windows, the caller is responsible for ensuring that any resulting indexing is in-bounds.
The caller must provide in-bounds columns for the preceding and following windows.
Now that the requirement for producing in-bounds data has moved to the caller, we must adapt the tests to produce the correct inputs.
These invariants are guaranteed by the constructor.
Now neither the ungrouped nor the grouped fixed size rolling window calculations need to materialise the preceding and following columns.
Sits on top of #17613, so don't merge until that is in and this is rebased/has trunk back-merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signposts
# ################################################################################################## | ||
# * rolling benchmark | ||
# --------------------------------------------------------------------------------- | ||
ConfigureNVBench(ROLLING_NVBENCH rolling/grouped_rolling_sum.cpp rolling/rolling_sum.cpp) | ||
|
||
add_custom_target( | ||
run_benchmarks | ||
DEPENDS CUDF_BENCHMARKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this, or review in #17613.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of #17613
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of #17613
@@ -1010,7 +1010,7 @@ class rolling_aggregation_postprocessor final : public cudf::detail::aggregation | |||
* @param[out] output_valid_count Output count of valid values | |||
* @param[in] device_operator The operator used to perform a single window operation | |||
* @param[in] preceding_window_begin Rolling window size iterator, accumulates from | |||
* in_col[i-preceding_window] to in_col[i] inclusive | |||
* in_col[i-preceding_window + 1] to in_col[i] inclusive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix off by one.
|
||
// aggregate | ||
// TODO: We should explore using shared memory to avoid redundant loads. | ||
// This might require separating the kernel into a special version | ||
// for dynamic and static sizes. | ||
|
||
volatile bool output_is_valid = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a leftover from a cuda 10-era bug.
namespace utils = cudf::detail::rolling; | ||
auto groups = utils::grouped{group_labels.data(), group_offsets.data()}; | ||
auto preceding = | ||
utils::make_clamped_window_iterator<utils::direction::PRECEDING>(preceding_window, groups); | ||
auto following = | ||
utils::make_clamped_window_iterator<utils::direction::FOLLOWING>(following_window, groups); | ||
return cudf::detail::rolling_window( | ||
input, default_outputs, preceding, following, min_periods, aggr, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the new utilities, so we can delete a bunch of code.
@@ -660,7 +661,7 @@ TEST_F(RollingErrorTest, WindowArraySizeMismatch) | |||
cudf::test::fixed_width_column_wrapper<cudf::size_type> input( | |||
col_data.begin(), col_data.end(), col_valid.begin()); | |||
|
|||
std::vector<cudf::size_type> five({2, 1, 2, 1, 4}); | |||
std::vector<cudf::size_type> five({1, 1, 2, 1, 0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs adapted to conform to new caller requirement.
// this is a special test to check the volatile count variable issue (see rolling.cu for detail) | ||
TYPED_TEST(RollingTest, VolatileCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All those volatiles are long-gone (cuda 10.1 bug)
auto it = thrust::make_counting_iterator<cudf::size_type>(0); | ||
std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto p = window_rng.generate(); | ||
return std::min(i + 1, std::max(p, i + 1 - num_rows)); | ||
}); | ||
std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto f = window_rng.generate(); | ||
return std::max(-i - 1, std::min(f, num_rows - i - 1)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensuring the randomly generated values conform to interface requirements
auto it = thrust::make_counting_iterator<cudf::size_type>(0); | ||
std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto p = window_rng.generate(); | ||
return std::min(i + 1, std::max(p, i + 1 - num_rows)); | ||
}); | ||
std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto f = window_rng.generate(); | ||
return std::max(-i - 1, std::min(f, num_rows - i - 1)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again.
Note that I haven't touched the jit rolling window kernel yet (some of the same changes will apply there eventually). @mythrocks This changes the API contract for to rolling_window, so spark may need to adapt. |
I'm trying to work out what of this has impact on |
I think this is a call into cudf/cpp/src/rolling/grouped_rolling.cu Lines 47 to 55 in 030ae7a
If so, those fixed offsets are still clamped to the group bounds, via a transform iterator: cudf/cpp/src/rolling/grouped_rolling.cu Lines 127 to 133 in 030ae7a
|
That's what I thought. It seems to me like this really ought to work for I'll try have a go at integrating this with |
Description
Having introduced benchmarks of rolling window performance in #17613, we can now look at what happens when cleaning up some the implementation.
Breaking change
Previously, the most general "variable-sized" rolling window function applied a transform iterator (sometimes multiple times) to the user-provided
preceding
andfollowing
columns. We now require, and document, that the user provides values that are always in bounds.This simplifies the book-keeping, and allows us to remove some (now redundant) checks from the rolling kernel.
Performance improvements
In various places, we accept fixed size integers as row-based offsets for rolling windows. We used to materialise these into columns representing the clamped
preceding
andfollowing
columns. However, we did so inconsistently with respect to which side we would clamp (after the change to allow negative window offsets).To fix this, rationalise this clamping in one place, and just use a transform iterator. We now do this consistently for both grouped and ungrouped fixed-size windows.
Benchmark results
Comparing da4accc (the tip of #17613) and this branch:
nvbench compare output
Headline numbers:
All the data
latest/_deps/nvbench-src/scripts/nvbench_compare.py da4accc.json c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json
['da4acccf6b76a93b1e9395b51950a17c45da99c3.json', 'c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json']
row_grouped_rolling_sum
[0] NVIDIA RTX A6000
row_fixed_rolling_sum
[0] NVIDIA RTX A6000
row_variable_rolling_sum
[0] NVIDIA RTX A6000
Summary
Before
Benchmark Results
row_grouped_rolling_sum
[0] NVIDIA RTX A6000
row_fixed_rolling_sum
[0] NVIDIA RTX A6000
row_variable_rolling_sum
[0] NVIDIA RTX A6000
After
Benchmark Results
row_grouped_rolling_sum
[0] NVIDIA RTX A6000
row_fixed_rolling_sum
[0] NVIDIA RTX A6000
row_variable_rolling_sum
[0] NVIDIA RTX A6000
Checklist