Skip to content

Partition Set Algorithm Balanced Path #2318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

danhoeflinger
Copy link
Contributor

Add partitioning kernel to set APIs balanced path algorithm.

Adds a partitioning phase which does a sparse pass over the input data to establish binary search boundaries for the main run. This allows memory access pattern to fit within L1 cache for the main kernels when performing the binary searches to establish balanced path intersections.

This improves performance for large sizes of the set algorithms. (When combined with #2317, it provides a nice combination of performance improvements for both large and small sizes of the set algorithms)

@danhoeflinger danhoeflinger requested a review from Copilot June 20, 2025 18:42
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a partitioning phase to the set algorithms, enhancing performance for large input sizes by establishing binary search boundaries that optimize cache usage. Key changes include updating the __gen_set_balanced_path template to accept an additional bounds provider parameter, adding new helper functions (__decode_balanced_path_temp_data, __encode_balanced_path_temp_data) for balanced path processing, and integrating a new partition kernel for the balanced path phase.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
test/general/implementation_details/device_copyable.pass.cpp Updated static_asserts to include new bounds provider parameter
include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h Modified __gen_set_balanced_path specialization to include _BoundsProvider
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Added helper functions & modified balanced path computation to include partitioning support and safeguard against out-of-bound element access
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Updated __parallel_set_reduce_then_scan integration with new bounds provider and partitioning kernel
Comments suppressed due to low confidence (4)

include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h:458

  • The specialization of __gen_set_balanced_path now includes the _BoundsProvider parameter; please verify that all downstream usages are updated accordingly to maintain consistent API behavior.
template <typename _SetOpCount, typename _BoundsProvider, typename _Compare>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h:1086

  • [nitpick] When constructing _GenReduceInput with the new _BoundsProvider, it would be helpful to document the role of __diagonal_spacing and __partition_size in determining partition sizes, ensuring that readers understand how these values impact performance.
                                       _BoundsProvider{__diagonal_spacing, __partition_size}, __comp};

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:698

  • The change from returning 0 to clamping __i_elem to __rng1.size() + __rng2.size() - 1 may affect the algorithm's edge-case handling; please confirm that this adjusted behavior correctly reflects the intended semantics.
        if (__i_elem >= __rng1.size() + __rng2.size())

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:821

  • [nitpick] Retrieving __tile_size from __gen_input.__get_bounds is critical for partitioning; please ensure that __tile_size is always correctly initialized and consistent for various input sizes to avoid unexpected partition boundaries.
        std::size_t __tile_size = __gen_input.__get_bounds.__tile_size;

__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals,
_GenReduceInput __gen_reduce_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals,
_GenReduceInput __gen_reduce_input)
__parallel_set_balanced_path_partition(sycl::queue& __q, const _InRng& __in_rng, std::size_t __num_diagonals,
_GenReduceInput __gen_reduce_input)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why you want to change this to const lvalue ref rather than forwarding.

Copy link
Contributor

@SergeyKopienko SergeyKopienko Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, __in_rng describes some source data, which is const and can't be modified inside this function. So I think better to describe this semantic by const ref instead of universal reference.
As evidence of this point, let's take a look how you call __parallel_set_balanced_path_partition:

    if (__total_size >= __partition_threshold)
    {
        __partition_event =
            __parallel_set_balanced_path_partition<_CustomName>(__q, __in_rng, __num_diagonals, __gen_reduce_input);
    }

What happens inside:

template <typename _CustomName, typename _InRng, typename _GenReduceInput>
sycl::event
__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals,
                                       _GenReduceInput __gen_reduce_input)
{
    //...
    return __partition_submitter(__q, std::forward<_InRng>(__in_rng), __num_diagonals);
}
template <typename _GenInput, typename _KernelName>
struct __partition_set_balanced_path_submitter;
template <typename _GenInput, typename... _KernelName>
struct __partition_set_balanced_path_submitter<_GenInput, __internal::__optional_kernel_name<_KernelName...>>
{
    template <typename _InRng>
    sycl::event
    operator()(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals) const
    {
        //...
                    __gen_input.__calc_partition_bounds(__in_rng, __id);
        //...
    }
    //...
};

And then we are in

    //Entry point for partitioning phase
    template <typename _InRng, typename _IndexT>
    void
    __calc_partition_bounds(const _InRng& __in_rng, _IndexT __id) const
    {
        calc_and_store_balanced_path(__in_rng, __id, oneapi::dpl::__par_backend_hetero::__get_bounds_simple{});
    }

So I think this chain of calls shown us that universal reference for __in_rng doesn't make sense and const _InRng& __in_rng is quite correct inside all this chain of calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you may take a look to the PR #2346 and merge it into your branch.

Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a first pass over the implementation. I like how the patch unifies handling between the partitioned and non-partitioned bounds.

// Should be safe to use the type of the range size as the temporary type. Diagonal index will fit in the positive
// portion of the range so star flag can use sign bit.
using _TemporaryType = decltype(__rng1.size());
//TODO: limit to diagonals per block, and only write to a block based index of temporary data
oneapi::dpl::__par_backend_hetero::__buffer<_TemporaryType> __temp_diags(__num_diagonals);

constexpr std::uint32_t __average_input_ele_size = (sizeof(_In1ValueT) + sizeof(_In2ValueT)) / 2;
// Partition into blocks of half SLM size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an empirically determined value and did you experiment with larger partition sizes?

My understanding is that on most architectures SLM is carved out of L1. Since we are trying to cache in L1, it may be possible to actually partition larger than the max SLM size and still have good caching so long as the L1 is large enough. Of course this is device specific and we do not want to overtune for a single device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with a few options here and this seemed to be the best option empirically. I can revisit and confirm now with a more final version though.

using _GenScanInput =
oneapi::dpl::__par_backend_hetero::__gen_set_op_from_known_balanced_path<_SetOperation, _TempData, _Compare>;
using _ScanInputTransform = oneapi::dpl::__par_backend_hetero::__get_zeroth_element;
using _WriteOp = oneapi::dpl::__par_backend_hetero::__write_multiple_to_id<oneapi::dpl::__internal::__pstl_assign>;

const std::int32_t __num_diagonals =
oneapi::dpl::__internal::__dpl_ceiling_div(__rng1.size() + __rng2.size(), __diagonal_spacing);

const std::uint32_t __work_group_size = __get_reduce_then_scan_workgroup_size(__q);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we calculate the work group size here instead of within __parallel_transform_reduce_then_scan? It looks like to me we can do this in __parallel_transform_reduce_then_scan and remove its __work_group_size parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a good question. I originally pulled this out to use in the calculation of partitioning size.
I'm going to explore partition size a bit more. If it is not useful then I will fold it back in to __parallel_transform_reduce_then_scan.
Thanks for calling this out.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/partition_set_algs branch from 4e4ca84 to b3c9236 Compare July 14, 2025 20:24
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants