Skip to content

Commit 09a34a0

Browse files
timmiesmithmmichel11Copilotdmitriy-sobolev
authored
cherry-pick release notes to release branch (#2494)
Signed-off-by: Matthew Michel <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Dmitriy Sobolev <[email protected]> * Update a known issue for range-based count_if (#2344) --------- Signed-off-by: Matthew Michel <[email protected]> Co-authored-by: Matthew Michel <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Dmitriy Sobolev <[email protected]>
1 parent f0862e1 commit 09a34a0

File tree

2 files changed

+89
-21
lines changed

2 files changed

+89
-21
lines changed

documentation/library_guide/kernel_templates/single_pass_scan.rst

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,13 @@ is an implementation of the Decoupled Look-back [#fnote1]_ scan algorithm.
1212

1313
The algorithm is designed to be compatible with a variety of devices that provide at least parallel
1414
forward progress guarantees between work-groups, due to cross-work-group communication. Additionally, it
15-
requires support for device USM (Unified Shared Memory). It has been verified to be compatible
16-
with `Intel® Data Center GPU Max Series
17-
<https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series/products.html>`_.
15+
requires support for device USM (Unified Shared Memory) and sub-group size of 32. It has been verified to be compatible
16+
with `Intel® Data Center GPU Max 1100
17+
<https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html>`_
18+
, `Intel® Data Center GPU Max 1550
19+
<https://www.intel.com/content/www/us/en/products/sku/232873/intel-data-center-gpu-max-1550/specifications.html>`_
20+
, and `Intel® Arc™ B580 Graphics
21+
<https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html>`_.
1822

1923
A synopsis of the ``inclusive_scan`` function is provided below:
2024

@@ -69,7 +73,8 @@ Parameters
6973

7074
**Type Requirements**:
7175

72-
- The element type of sequence to scan must be a 32-bit or 64-bit bit C++ integral or floating-point type.
76+
- The element type of sequence to scan must be an 8-bit, 16-bit, 32-bit, or 64-bit C++ integral or floating-point
77+
type.
7378
- The result is non-deterministic if the binary operator is non-associative (such as in floating-point addition)
7479
or non-commutative.
7580

@@ -81,9 +86,6 @@ Parameters
8186
- The function is intended to be asynchronous, but in some cases, the function will not return until the algorithm fully completes.
8287
Although intended in the future to be an asynchronous call, the algorithm is currently synchronous.
8388
- The SYCL device associated with the provided queue must support 64-bit atomic operations if the element type is 64-bits.
84-
- There must be a known identity value for the provided combination of the element type and the binary operation. That is,
85-
``sycl::has_known_identity_v`` must evaluate to true. Such operators are listed in
86-
the `SYCL 2020 specification <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.identities>`_.
8789

8890
Return Value
8991
------------
@@ -145,18 +147,19 @@ inclusive_scan Example
145147
Memory Requirements
146148
-------------------
147149

148-
The algorithm uses global and local device memory (see `SYCL 2020 Specification
150+
The algorithm uses global, local, and private device memory (see `SYCL 2020 Specification
149151
<https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_sycl_device_memory_model>`__)
150152
for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device.
151153
If there is not enough global device memory, a ``std::bad_alloc`` exception is thrown.
152-
The behavior is undefined if there is not enough local memory.
153-
The amount of memory that is required depends on input data and configuration parameters, as described below.
154+
The behavior is undefined if there is not enough local memory. If there is insufficient private register memory, then
155+
algorithmic performance will degrade. The amount of memory that is required depends on input data and configuration
156+
parameters, as described below.
154157

155158
Global Memory Requirements
156159
--------------------------
157160

158161
Global memory is used for copying the input sequence and storing internal data such as status flags.
159-
The used amount depends on many parameters; below is an approximation in bytes:
162+
The used amount depends on many parameters; below is an upper bound approximation in bytes:
160163

161164
2 * V * N \ :sub:`flags` + 4 * N \ :sub:`flags`
162165

@@ -174,11 +177,19 @@ It can be approximated by dividing the number of input elements N by the product
174177
Local Memory Requirements
175178
-------------------------
176179

177-
Local memory is used for storing elements of the input that are to be scanned by a single work-group.
178-
The used amount is denoted as N\ :sub:`elems_per_workgroup`, which equals to ``sizeof(key_type) * param.data_per_workitem * param.workgroup_size``.
180+
Local memory is used for storing partial scan computations per sub-group in a work-group.
181+
The used amount is denoted as N\ :sub:`sub_group_carries`, which equals ``sizeof(key_type) * param.workgroup_size / sub_group_size``
182+
where ``sub_group_size`` is the size of the sub-group currently fixed to 32.
179183

180-
Some amount of local memory is also used by the calls to SYCL's group reduction and group scan. The amount of memory used particularly
181-
for these calls is implementation dependent.
184+
Private Memory Requirements
185+
---------------------------
186+
187+
The implementation is most performant when all private memory is allocated to registers and does not spill into global
188+
memory scratch space reserved for the kernel. The amount of private memory used per work-group is ``V * W * D + ε``
189+
where V is the number of bytes needed to store the input value type, W is ``param.workgroup_size``, D is
190+
``param.data_per_workitem``, and ε is the remaining private memory used by local variables and the binary operation. ε
191+
is expected to carry a small footprint in most common use cases. If the binary operation uses many registers, then the
192+
impact of ε may be of greater significance.
182193

183194
-----------------------------------------
184195
Recommended Settings for Best Performance
@@ -195,6 +206,12 @@ The initial configuration may be selected according to these high-level guidelin
195206
compute cores is key for better performance. To allow sufficient work to satisfy all
196207
X\ :sup:`e`-cores [#fnote2]_ on a GPU, use ``param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count``.
197208

209+
- For large inputs that fully saturate compute cores, maximizing ``param.workgroup_size`` and ``param.data_per_workitem``
210+
without spilling out of register memory results in best performance. The Intel® oneAPI DPC++ Compiler reports warnings
211+
when register spillage occurs. This may be used alongside guidance provided in the
212+
`oneAPI GPU Optimization Guide <https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-2/registers-and-performance.html>`_
213+
and benchmarking parameter sweeps to determine performant kernel template parameters for your use case.
214+
198215
- On devices with multiple tiles, it may prove beneficial to experiment with different tile hierarchies as described
199216
in `Options for using a GPU Tile Hierarchy <https://www.intel.com/content/www/us/en/developer/articles/technical/flattening-gpu-tile-hierarchy.html>`_.
200217

documentation/release_notes.rst

Lines changed: 57 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,67 @@ creating efficient heterogeneous applications.
1111
New in 2022.10.0
1212
================
1313

14+
Deprecation Notices
15+
-------------------
16+
The ``ONEDPL_USE_AOT_COMPILATION`` and ``ONEDPL_AOT_ARCH`` CMake options are deprecated and will be removed in a future
17+
release. Please use the relevant compiler flags to enable this feature.
18+
19+
New Features
20+
------------
21+
- Added parallel range algorithms in ``namespace oneapi::dpl::ranges``: ``set_intersection``, ``set_union``,
22+
``set_difference``, ``set_symmetric_difference``, ``includes``, ``unique``, ``unique_copy``, ``destroy``,
23+
``uninitialized_fill``, ``uninitialized_move``, ``uninitialized_copy``, ``uninitialized_value_construct``,
24+
``uninitialized_default_construct``, ``reverse``, ``reverse_copy``, ``swap_ranges``. These algorithms operate with
25+
C++20 random access ranges.
26+
- Improved performance of ``gpu::inclusive_scan`` kernel template and added support for binary operator and type
27+
combinations which do not have a SYCL known identity.
28+
- Improved performance of ``inclusive_scan_by_segment``, ``exclusive_scan_by_segment``, ``set_union``,
29+
``set_difference``, ``set_intersection``, and ``set_symmetric_difference`` when using device policies.
30+
- Improved performance of search operations (e.g., ``find``, ``all_of``, ``equal``, ``search``, etc.), ``is_heap`` and
31+
``is_heap_until`` algorithms on Intel® Arc™ B-series GPU devices.
32+
33+
Fixed Issues
34+
------------
35+
- Removed requirement of GPU double precision support to use ``set_union``, ``set_difference``, ``set_intersection``,
36+
and ``set_symmetric_difference`` on Windows operating systems.
37+
- Removed default-constructible requirements from the value type for ``reduce`` and ``transform_reduce`` algorithms,
38+
as well as copy-constructible requirements when these algorithms are used with a native ("host") policy.
39+
- Fixed an issue with ``ranges::merge`` when projections of the two input ranges were not the same.
40+
- Fixed ``equal`` returning a ``false`` for empty input sequences; now it returns ``true``.
41+
- Fixed a compilation error **SYCL kernel cannot use exceptions** occurring with libstdc++ version 10 when calling
42+
``adjacent_find``, ``is_sorted`` and ``is_sorted_until`` range algorithms with device policies.
43+
- Fixed an issue with ``PSTL_USE_NONTEMPORAL_STORES`` macro having no effect.
44+
- Fixed a bug where ``unique`` called with a device policy returned an incorrect result iterator.
45+
- Fixed a bug in ``exclusive_scan``, ``inclusive_scan``, ``transform_exclusive_scan``, ``transform_inclusive_scan``,
46+
``exlusive_scan_by_segment``, and ``inclusive_scan_by_segment`` algorithms when using device policies with different
47+
input and output value types.
48+
- Fixed a bug in return value types of ``minmax_element`` and ``mismatch`` range algorithms.
49+
- Fixed compile errors in ``set_union`` and ``set_symmetric_difference`` when using device policies
50+
with different second-input and output value types.
51+
1452
Known Issues and Limitations
1553
----------------------------
1654
New in This Release
1755
^^^^^^^^^^^^^^^^^^^
18-
- Calling ``histogram`` algorithm with a device execution policy may cause a segmentation fault in
19-
Intel® oneAPI DPC++/C++ Compiler 2025.3 when compiling SYCL kernels for CPU devices.
20-
To avoid this, define ``ONEDPL_DISABLE_HISTOGRAM_REGISTER_REDUCTION`` macro to a non-zero value
21-
prior to including oneDPL header files.
56+
- ``copy_if``, ``unique_copy``, ``set_union``, ``set_intersection``, ``set_difference``, ``set_symmetric_difference``
57+
range algorithms require the output range to have sufficient size to hold all resulting elements.
58+
59+
Existing Issues
60+
^^^^^^^^^^^^^^^
61+
See oneDPL Guide for other `restrictions and known limitations`_.
62+
63+
- ``histogram`` algorithm requires the output value type to be an integral type no larger than four bytes
64+
when used with a device policy on hardware that does not support 64-bit atomic operations.
65+
- For ``transform_exclusive_scan`` and ``exclusive_scan`` to run in-place (that is, with the same data
66+
used for both input and destination) and with an execution policy of ``unseq`` or ``par_unseq``,
67+
it is required that the provided input and destination iterators are equality comparable.
68+
Furthermore, the equality comparison of the input and destination iterator must evaluate to true.
69+
If these conditions are not met, the result of these algorithm calls is undefined.
70+
- Incorrect results may be produced by ``exclusive_scan``, ``inclusive_scan``, ``transform_exclusive_scan``,
71+
``transform_inclusive_scan``, ``exclusive_scan_by_segment``, ``inclusive_scan_by_segment``, ``reduce_by_segment``
72+
with ``unseq`` or ``par_unseq`` policy when compiled by Intel® oneAPI DPC++/C++ Compiler 2024.1 or earlier
73+
with ``-fiopenmp``, ``-fiopenmp-simd``, ``-qopenmp``, ``-qopenmp-simd`` options on Linux.
74+
To avoid the issue, pass ``-fopenmp`` or ``-fopenmp-simd`` option instead.
2275

2376
New in 2022.9.0
2477
===============
@@ -71,8 +124,6 @@ See oneDPL Guide for other `restrictions and known limitations`_.
71124
To avoid the issue, pass ``-fopenmp`` or ``-fopenmp-simd`` option instead.
72125
- With libstdc++ version 10, the compilation error *SYCL kernel cannot use exceptions* occurs
73126
when calling the range-based ``adjacent_find``, ``is_sorted`` or ``is_sorted_until`` algorithms with device policies.
74-
- The range-based ``count_if`` may produce incorrect results on Intel® Data Center GPU Max Series when the driver version
75-
is "Rolling 2507.12" and newer.
76127

77128
New in 2022.8.0
78129
===============

0 commit comments

Comments
 (0)