Partition Set Algorithm Balanced Path #2318

danhoeflinger · 2025-06-20T18:41:53Z

Add partitioning kernel to set APIs balanced path algorithm.

Adds a partitioning phase which does a sparse pass over the input data to establish binary search boundaries for the main run. This allows memory access pattern to fit within L1 cache for the main kernels when performing the binary searches to establish balanced path intersections.

This improves performance for large sizes of the set algorithms. (When combined with #2317, it provides a nice combination of performance improvements for both large and small sizes of the set algorithms)

Copilot

Pull Request Overview

This PR introduces a partitioning phase to the set algorithms, enhancing performance for large input sizes by establishing binary search boundaries that optimize cache usage. Key changes include updating the __gen_set_balanced_path template to accept an additional bounds provider parameter, adding new helper functions (__decode_balanced_path_temp_data, __encode_balanced_path_temp_data) for balanced path processing, and integrating a new partition kernel for the balanced path phase.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
test/general/implementation_details/device_copyable.pass.cpp	Updated static_asserts to include new bounds provider parameter
include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h	Modified __gen_set_balanced_path specialization to include _BoundsProvider
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h	Added helper functions & modified balanced path computation to include partitioning support and safeguard against out-of-bound element access
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h	Updated __parallel_set_reduce_then_scan integration with new bounds provider and partitioning kernel

Comments suppressed due to low confidence (4)

include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h:458

The specialization of __gen_set_balanced_path now includes the _BoundsProvider parameter; please verify that all downstream usages are updated accordingly to maintain consistent API behavior.

template <typename _SetOpCount, typename _BoundsProvider, typename _Compare>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h:1086

[nitpick] When constructing _GenReduceInput with the new _BoundsProvider, it would be helpful to document the role of __diagonal_spacing and __partition_size in determining partition sizes, ensuring that readers understand how these values impact performance.

                                       _BoundsProvider{__diagonal_spacing, __partition_size}, __comp};

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:698

The change from returning 0 to clamping __i_elem to __rng1.size() + __rng2.size() - 1 may affect the algorithm's edge-case handling; please confirm that this adjusted behavior correctly reflects the intended semantics.

        if (__i_elem >= __rng1.size() + __rng2.size())

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:821

[nitpick] Retrieving __tile_size from __gen_input.__get_bounds is critical for partitioning; please ensure that __tile_size is always correctly initialized and consistent for various input sizes to avoid unexpected partition boundaries.

        std::size_t __tile_size = __gen_input.__get_bounds.__tile_size;

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

SergeyKopienko · 2025-06-30T08:51:40Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

+__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals,
+                                       _GenReduceInput __gen_reduce_input)


Suggested change

__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals,

_GenReduceInput __gen_reduce_input)

__parallel_set_balanced_path_partition(sycl::queue& __q, const _InRng& __in_rng, std::size_t __num_diagonals,

_GenReduceInput __gen_reduce_input)

I'm not sure I understand why you want to change this to const lvalue ref rather than forwarding.

As far as I understand, __in_rng describes some source data, which is const and can't be modified inside this function. So I think better to describe this semantic by const ref instead of universal reference.
As evidence of this point, let's take a look how you call __parallel_set_balanced_path_partition:

if (__total_size >= __partition_threshold) { __partition_event = __parallel_set_balanced_path_partition<_CustomName>(__q, __in_rng, __num_diagonals, __gen_reduce_input); }

What happens inside:

template <typename _CustomName, typename _InRng, typename _GenReduceInput> sycl::event __parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals, _GenReduceInput __gen_reduce_input) { //... return __partition_submitter(__q, std::forward<_InRng>(__in_rng), __num_diagonals); }

template <typename _GenInput, typename _KernelName> struct __partition_set_balanced_path_submitter; template <typename _GenInput, typename... _KernelName> struct __partition_set_balanced_path_submitter<_GenInput, __internal::__optional_kernel_name<_KernelName...>> { template <typename _InRng> sycl::event operator()(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals) const { //... __gen_input.__calc_partition_bounds(__in_rng, __id); //... } //... };

And then we are in

//Entry point for partitioning phase template <typename _InRng, typename _IndexT> void __calc_partition_bounds(const _InRng& __in_rng, _IndexT __id) const { calc_and_store_balanced_path(__in_rng, __id, oneapi::dpl::__par_backend_hetero::__get_bounds_simple{}); }

So I think this chain of calls shown us that universal reference for __in_rng doesn't make sense and const _InRng& __in_rng is quite correct inside all this chain of calls.

If you want, you may take a look to the PR #2346 and merge it into your branch.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

mmichel11

I have made a first pass over the implementation. I like how the patch unifies handling between the partitioned and non-partitioned bounds.

mmichel11 · 2025-06-30T15:37:06Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

    // Should be safe to use the type of the range size as the temporary type. Diagonal index will fit in the positive
    // portion of the range so star flag can use sign bit.
    using _TemporaryType = decltype(__rng1.size());
    //TODO: limit to diagonals per block, and only write to a block based index of temporary data
    oneapi::dpl::__par_backend_hetero::__buffer<_TemporaryType> __temp_diags(__num_diagonals);

+    constexpr std::uint32_t __average_input_ele_size = (sizeof(_In1ValueT) + sizeof(_In2ValueT)) / 2;
+    // Partition into blocks of half SLM size


Is this an empirically determined value and did you experiment with larger partition sizes?

My understanding is that on most architectures SLM is carved out of L1. Since we are trying to cache in L1, it may be possible to actually partition larger than the max SLM size and still have good caching so long as the L1 is large enough. Of course this is device specific and we do not want to overtune for a single device.

I played with a few options here and this seemed to be the best option empirically. I can revisit and confirm now with a more final version though.

mmichel11 · 2025-06-30T17:55:46Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

    using _GenScanInput =
        oneapi::dpl::__par_backend_hetero::__gen_set_op_from_known_balanced_path<_SetOperation, _TempData, _Compare>;
    using _ScanInputTransform = oneapi::dpl::__par_backend_hetero::__get_zeroth_element;
    using _WriteOp = oneapi::dpl::__par_backend_hetero::__write_multiple_to_id<oneapi::dpl::__internal::__pstl_assign>;

    const std::int32_t __num_diagonals =
        oneapi::dpl::__internal::__dpl_ceiling_div(__rng1.size() + __rng2.size(), __diagonal_spacing);
-
+    const std::uint32_t __work_group_size = __get_reduce_then_scan_workgroup_size(__q);


Why do we calculate the work group size here instead of within __parallel_transform_reduce_then_scan? It looks like to me we can do this in __parallel_transform_reduce_then_scan and remove its __work_group_size parameter.

Its a good question. I originally pulled this out to use in the calculation of partitioning size.
I'm going to explore partition size a bit more. If it is not useful then I will fold it back in to __parallel_transform_reduce_then_scan.
Thanks for calling this out.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

Signed-off-by: Dan Hoeflinger <[email protected]>

This reverts commit ffe92e9.

Signed-off-by: Dan Hoeflinger <[email protected]>

This reverts commit 710c080.

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger requested a review from Copilot June 20, 2025 18:42

Copilot AI reviewed Jun 20, 2025

View reviewed changes

danhoeflinger requested review from reble, timmiesmith, SergeyKopienko, mmichel11, adamfidel, dmitriy-sobolev, akukanov and MikeDvorskiy June 25, 2025 17:49