-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Implement HOST_UDF
aggregations
#17249
base: branch-24.12
Are you sure you want to change the base?
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
cpp/tests/groupby/host_udf_tests.cu
Outdated
cudf::data_type{cudf::type_id::INT32}, values.size(), cudf::mask_state::UNALLOCATED, stream); | ||
thrust::transform(rmm::exec_policy(stream), | ||
thrust::make_counting_iterator(0), | ||
thrust::make_counting_iterator(values.size()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have some examples that actually do an aggregation? This is producing an output for each input value, but aggregations are supposed to produce an output for each group. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is just a simple transformation. The input parameters have enough information (group offsets and labels) thus we can easily implement anything that we want on the group values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding such examples now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the examples are computing values for each individual group. For example:
For each group: compute (group_idx + 1)* values^2 * 2
using host_udf_func_type = std::function<std::unique_ptr<column>(column_view const&, | ||
device_span<size_type const>, | ||
device_span<size_type const>, | ||
size_type, | ||
rmm::cuda_stream_view, | ||
rmm::device_async_resource_ref)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only passes group values (the first column parameter). It seems that we should better pass the group keys too, to have all the group information needed for generic computation.
Signed-off-by: Nghia Truong <[email protected]>
// TODO: Add a name string to the aggregation so that we can look up different host UDFs. | ||
if (cache.has_result(values, agg)) { return; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Why not ask the implementer of the host udf to provide hash and equality, like the other aggregations have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing a name string should be enough for hashing agg
here as we will hash a pair {aggregation::kind, udf_name_str}
. That will be much simpler than providing a hash and equality functor.
This implements
HOST_UDF
aggregations, allowing to call aggregations on external implemented functions.Closes #16633.
Usage
HOST_UDF
aggregate instance with parameter is a function pointer pointing to that function: