-
Notifications
You must be signed in to change notification settings - Fork 116
Partition Set Algorithm Balanced Path #2318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a partitioning phase to the set algorithms, enhancing performance for large input sizes by establishing binary search boundaries that optimize cache usage. Key changes include updating the __gen_set_balanced_path template to accept an additional bounds provider parameter, adding new helper functions (__decode_balanced_path_temp_data, __encode_balanced_path_temp_data) for balanced path processing, and integrating a new partition kernel for the balanced path phase.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
test/general/implementation_details/device_copyable.pass.cpp | Updated static_asserts to include new bounds provider parameter |
include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h | Modified __gen_set_balanced_path specialization to include _BoundsProvider |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h | Added helper functions & modified balanced path computation to include partitioning support and safeguard against out-of-bound element access |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h | Updated __parallel_set_reduce_then_scan integration with new bounds provider and partitioning kernel |
Comments suppressed due to low confidence (4)
include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h:458
- The specialization of __gen_set_balanced_path now includes the _BoundsProvider parameter; please verify that all downstream usages are updated accordingly to maintain consistent API behavior.
template <typename _SetOpCount, typename _BoundsProvider, typename _Compare>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h:1086
- [nitpick] When constructing _GenReduceInput with the new _BoundsProvider, it would be helpful to document the role of __diagonal_spacing and __partition_size in determining partition sizes, ensuring that readers understand how these values impact performance.
_BoundsProvider{__diagonal_spacing, __partition_size}, __comp};
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:698
- The change from returning 0 to clamping __i_elem to __rng1.size() + __rng2.size() - 1 may affect the algorithm's edge-case handling; please confirm that this adjusted behavior correctly reflects the intended semantics.
if (__i_elem >= __rng1.size() + __rng2.size())
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:821
- [nitpick] Retrieving __tile_size from __gen_input.__get_bounds is critical for partitioning; please ensure that __tile_size is always correctly initialized and consistent for various input sizes to avoid unexpected partition boundaries.
std::size_t __tile_size = __gen_input.__get_bounds.__tile_size;
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals, | ||
_GenReduceInput __gen_reduce_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals, | |
_GenReduceInput __gen_reduce_input) | |
__parallel_set_balanced_path_partition(sycl::queue& __q, const _InRng& __in_rng, std::size_t __num_diagonals, | |
_GenReduceInput __gen_reduce_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why you want to change this to const lvalue ref rather than forwarding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, __in_rng
describes some source data, which is const
and can't be modified inside this function. So I think better to describe this semantic by const ref
instead of universal reference.
As evidence of this point, let's take a look how you call __parallel_set_balanced_path_partition
:
if (__total_size >= __partition_threshold)
{
__partition_event =
__parallel_set_balanced_path_partition<_CustomName>(__q, __in_rng, __num_diagonals, __gen_reduce_input);
}
What happens inside:
template <typename _CustomName, typename _InRng, typename _GenReduceInput>
sycl::event
__parallel_set_balanced_path_partition(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals,
_GenReduceInput __gen_reduce_input)
{
//...
return __partition_submitter(__q, std::forward<_InRng>(__in_rng), __num_diagonals);
}
template <typename _GenInput, typename _KernelName>
struct __partition_set_balanced_path_submitter;
template <typename _GenInput, typename... _KernelName>
struct __partition_set_balanced_path_submitter<_GenInput, __internal::__optional_kernel_name<_KernelName...>>
{
template <typename _InRng>
sycl::event
operator()(sycl::queue& __q, _InRng&& __in_rng, std::size_t __num_diagonals) const
{
//...
__gen_input.__calc_partition_bounds(__in_rng, __id);
//...
}
//...
};
And then we are in
//Entry point for partitioning phase
template <typename _InRng, typename _IndexT>
void
__calc_partition_bounds(const _InRng& __in_rng, _IndexT __id) const
{
calc_and_store_balanced_path(__in_rng, __id, oneapi::dpl::__par_backend_hetero::__get_bounds_simple{});
}
So I think this chain of calls shown us that universal reference for __in_rng
doesn't make sense and const _InRng& __in_rng
is quite correct inside all this chain of calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you may take a look to the PR #2346 and merge it into your branch.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made a first pass over the implementation. I like how the patch unifies handling between the partitioned and non-partitioned bounds.
// Should be safe to use the type of the range size as the temporary type. Diagonal index will fit in the positive | ||
// portion of the range so star flag can use sign bit. | ||
using _TemporaryType = decltype(__rng1.size()); | ||
//TODO: limit to diagonals per block, and only write to a block based index of temporary data | ||
oneapi::dpl::__par_backend_hetero::__buffer<_TemporaryType> __temp_diags(__num_diagonals); | ||
|
||
constexpr std::uint32_t __average_input_ele_size = (sizeof(_In1ValueT) + sizeof(_In2ValueT)) / 2; | ||
// Partition into blocks of half SLM size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an empirically determined value and did you experiment with larger partition sizes?
My understanding is that on most architectures SLM is carved out of L1. Since we are trying to cache in L1, it may be possible to actually partition larger than the max SLM size and still have good caching so long as the L1 is large enough. Of course this is device specific and we do not want to overtune for a single device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with a few options here and this seemed to be the best option empirically. I can revisit and confirm now with a more final version though.
using _GenScanInput = | ||
oneapi::dpl::__par_backend_hetero::__gen_set_op_from_known_balanced_path<_SetOperation, _TempData, _Compare>; | ||
using _ScanInputTransform = oneapi::dpl::__par_backend_hetero::__get_zeroth_element; | ||
using _WriteOp = oneapi::dpl::__par_backend_hetero::__write_multiple_to_id<oneapi::dpl::__internal::__pstl_assign>; | ||
|
||
const std::int32_t __num_diagonals = | ||
oneapi::dpl::__internal::__dpl_ceiling_div(__rng1.size() + __rng2.size(), __diagonal_spacing); | ||
|
||
const std::uint32_t __work_group_size = __get_reduce_then_scan_workgroup_size(__q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we calculate the work group size here instead of within __parallel_transform_reduce_then_scan
? It looks like to me we can do this in __parallel_transform_reduce_then_scan
and remove its __work_group_size
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a good question. I originally pulled this out to use in the calculation of partitioning size.
I'm going to explore partition size a bit more. If it is not useful then I will fold it back in to __parallel_transform_reduce_then_scan
.
Thanks for calling this out.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
This reverts commit ffe92e9.
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
This reverts commit 710c080.
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
4e4ca84
to
b3c9236
Compare
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Add partitioning kernel to set APIs balanced path algorithm.
Adds a partitioning phase which does a sparse pass over the input data to establish binary search boundaries for the main run. This allows memory access pattern to fit within L1 cache for the main kernels when performing the binary searches to establish balanced path intersections.
This improves performance for large sizes of the set algorithms. (When combined with #2317, it provides a nice combination of performance improvements for both large and small sizes of the set algorithms)