Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Implement HOST_UDF aggregations #17249

Draft
wants to merge 3 commits into
base: branch-24.12
Choose a base branch
from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Nov 5, 2024

This implements HOST_UDF aggregations, allowing to call aggregations on external implemented functions.

Closes #16633.


Usage

  1. Define a function with this signature:
std::unique_ptr<cudf::column> compute_aggregation(cudf::column_view const& values,
                                         cudf::device_span<cudf::size_type const> group_offsets,
                                         cudf::device_span<cudf::size_type const> group_labels,
                                         cudf::size_type num_groups,
                                         rmm::cuda_stream_view stream,
                                         rmm::device_async_resource_ref mr);
  1. Make HOST_UDF aggregate instance with parameter is a function pointer pointing to that function:
auto agg = cudf::make_host_udf_aggregation<cudf::groupby_aggregation>(compute_aggregation);
  1. Perform cudf aggregation operations on the created aggregate instance.

Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia self-assigned this Nov 5, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Nov 5, 2024
@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress Spark Functionality that helps Spark RAPIDS and removed CMake CMake build issue labels Nov 5, 2024
@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 5, 2024

cudf::data_type{cudf::type_id::INT32}, values.size(), cudf::mask_state::UNALLOCATED, stream);
thrust::transform(rmm::exec_policy(stream),
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(values.size()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have some examples that actually do an aggregation? This is producing an output for each input value, but aggregations are supposed to produce an output for each group. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is just a simple transformation. The input parameters have enough information (group offsets and labels) thus we can easily implement anything that we want on the group values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding such examples now.

Copy link
Contributor Author

@ttnghia ttnghia Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the examples are computing values for each individual group. For example:

For each group: compute (group_idx + 1)* values^2 * 2

Comment on lines +787 to +792
using host_udf_func_type = std::function<std::unique_ptr<column>(column_view const&,
device_span<size_type const>,
device_span<size_type const>,
size_type,
rmm::cuda_stream_view,
rmm::device_async_resource_ref)>;
Copy link
Contributor Author

@ttnghia ttnghia Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only passes group values (the first column parameter). It seems that we should better pass the group keys too, to have all the group information needed for generic computation.

@github-actions github-actions bot added the CMake CMake build issue label Nov 5, 2024
Comment on lines +797 to +798
// TODO: Add a name string to the aggregation so that we can look up different host UDFs.
if (cache.has_result(values, agg)) { return; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Why not ask the implementer of the host udf to provide hash and equality, like the other aggregations have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a name string should be enough for hashing agg here as we will hash a pair {aggregation::kind, udf_name_str}. That will be much simpler than providing a hash and equality functor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA] Allow to run groupby/reduction with externally derived aggregations
3 participants