cherry-pick release notes to release branch (#2494)

timmiesmith · mmichel11 · Copilot · web-flow · commit 09a34a0b31ad · 2025-10-08T09:02:01.000-05:00
Signed-off-by: Matthew Michel <matthew.michel@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Dmitriy Sobolev <Dmitriy.Sobolev@intel.com> * Update a known issue for range-based count_if (#2344) --------- Signed-off-by: Matthew Michel <matthew.michel@intel.com> Co-authored-by: Matthew Michel <matthew.michel@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Dmitriy Sobolev <Dmitriy.Sobolev@intel.com>
diff --git a/documentation/library_guide/kernel_templates/single_pass_scan.rst b/documentation/library_guide/kernel_templates/single_pass_scan.rst
@@ -12,9 +12,13 @@ is an implementation of the Decoupled Look-back [#fnote1]_ scan algorithm.
 
 The algorithm is designed to be compatible with a variety of devices that provide at least parallel
 forward progress guarantees between work-groups, due to cross-work-group communication. Additionally, it
-requires support for device USM (Unified Shared Memory). It has been verified to be compatible
-with `Intel® Data Center GPU Max Series
-<https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series/products.html>`_.
+requires support for device USM (Unified Shared Memory) and sub-group size of 32. It has been verified to be compatible
+with `Intel® Data Center GPU Max 1100
+<https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html>`_
+, `Intel® Data Center GPU Max 1550
+<https://www.intel.com/content/www/us/en/products/sku/232873/intel-data-center-gpu-max-1550/specifications.html>`_
+, and `Intel® Arc™ B580 Graphics
+<https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html>`_.
 
 A synopsis of the ``inclusive_scan`` function is provided below:
 
@@ -69,7 +73,8 @@ Parameters
 
 **Type Requirements**:
 
-- The element type of sequence to scan must be a 32-bit or 64-bit bit C++ integral or floating-point type.
+- The element type of sequence to scan must be an 8-bit, 16-bit, 32-bit, or 64-bit C++ integral or floating-point
+  type.
 - The result is non-deterministic if the binary operator is non-associative (such as in floating-point addition)
   or non-commutative.
 
@@ -81,9 +86,6 @@ Parameters
   - The function is intended to be asynchronous, but in some cases, the function will not return until the algorithm fully completes.
     Although intended in the future to be an asynchronous call, the algorithm is currently synchronous.
   - The SYCL device associated with the provided queue must support 64-bit atomic operations if the element type is 64-bits.
-  - There must be a known identity value for the provided combination of the element type and the binary operation. That is,
-    ``sycl::has_known_identity_v`` must evaluate to true. Such operators are listed in
-    the `SYCL 2020 specification <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.identities>`_.
 
 Return Value
 ------------
@@ -145,18 +147,19 @@ inclusive_scan Example
 Memory Requirements
 -------------------
 
-The algorithm uses global and local device memory (see `SYCL 2020 Specification
+The algorithm uses global, local, and private device memory (see `SYCL 2020 Specification
 <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_sycl_device_memory_model>`__)
 for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device.
 If there is not enough global device memory, a ``std::bad_alloc`` exception is thrown.
-The behavior is undefined if there is not enough local memory.
-The amount of memory that is required depends on input data and configuration parameters, as described below.
+The behavior is undefined if there is not enough local memory. If there is insufficient private register memory, then
+algorithmic performance will degrade. The amount of memory that is required depends on input data and configuration
+parameters, as described below.
 
 Global Memory Requirements
 --------------------------
 
 Global memory is used for copying the input sequence and storing internal data such as status flags.
-The used amount depends on many parameters; below is an approximation in bytes:
+The used amount depends on many parameters; below is an upper bound approximation in bytes:
 
 2 * V * N \ :sub:`flags` + 4 * N \ :sub:`flags`
 
@@ -174,11 +177,19 @@ It can be approximated by dividing the number of input elements N by the product
 Local Memory Requirements
 -------------------------
 
-Local memory is used for storing elements of the input that are to be scanned by a single work-group.
-The used amount is denoted as N\ :sub:`elems_per_workgroup`, which equals to ``sizeof(key_type) * param.data_per_workitem * param.workgroup_size``.
+Local memory is used for storing partial scan computations per sub-group in a work-group.
+The used amount is denoted as N\ :sub:`sub_group_carries`, which equals ``sizeof(key_type) * param.workgroup_size / sub_group_size``
+where ``sub_group_size`` is the size of the sub-group currently fixed to 32.
 
-Some amount of local memory is also used by the calls to SYCL's group reduction and group scan. The amount of memory used particularly
-for these calls is implementation dependent.
+Private Memory Requirements
+---------------------------
+
+The implementation is most performant when all private memory is allocated to registers and does not spill into global
+memory scratch space reserved for the kernel. The amount of private memory used per work-group is ``V * W * D + ε``
+where V is the number of bytes needed to store the input value type, W is ``param.workgroup_size``, D is
+``param.data_per_workitem``, and ε is the remaining private memory used by local variables and the binary operation. ε
+is expected to carry a small footprint in most common use cases. If the binary operation uses many registers, then the
+impact of ε may be of greater significance.
 
 -----------------------------------------
 Recommended Settings for Best Performance
@@ -195,6 +206,12 @@ The initial configuration may be selected according to these high-level guidelin
   compute cores is key for better performance. To allow sufficient work to satisfy all
   X\ :sup:`e`-cores [#fnote2]_ on a GPU, use ``param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count``.
 
+- For large inputs that fully saturate compute cores, maximizing ``param.workgroup_size`` and ``param.data_per_workitem``
+  without spilling out of register memory results in best performance. The Intel® oneAPI DPC++ Compiler reports warnings
+  when register spillage occurs. This may be used alongside guidance provided in the
+  `oneAPI GPU Optimization Guide <https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-2/registers-and-performance.html>`_
+  and benchmarking parameter sweeps to determine performant kernel template parameters for your use case.
+
 - On devices with multiple tiles, it may prove beneficial to experiment with different tile hierarchies as described
   in `Options for using a GPU Tile Hierarchy <https://www.intel.com/content/www/us/en/developer/articles/technical/flattening-gpu-tile-hierarchy.html>`_.
 
diff --git a/documentation/release_notes.rst b/documentation/release_notes.rst
@@ -11,14 +11,67 @@ creating efficient heterogeneous applications.
 New in 2022.10.0
 ================
 
+Deprecation Notices
+-------------------
+The ``ONEDPL_USE_AOT_COMPILATION`` and ``ONEDPL_AOT_ARCH`` CMake options are deprecated and will be removed in a future
+release. Please use the relevant compiler flags to enable this feature.
+
+New Features
+------------
+- Added parallel range algorithms in ``namespace oneapi::dpl::ranges``: ``set_intersection``, ``set_union``,
+  ``set_difference``, ``set_symmetric_difference``, ``includes``, ``unique``, ``unique_copy``, ``destroy``,
+  ``uninitialized_fill``, ``uninitialized_move``, ``uninitialized_copy``, ``uninitialized_value_construct``,
+  ``uninitialized_default_construct``, ``reverse``, ``reverse_copy``, ``swap_ranges``. These algorithms operate with
+  C++20 random access ranges.
+- Improved performance of ``gpu::inclusive_scan`` kernel template and added support for binary operator and type
+  combinations which do not have a SYCL known identity.
+- Improved performance of ``inclusive_scan_by_segment``, ``exclusive_scan_by_segment``, ``set_union``,
+  ``set_difference``, ``set_intersection``, and ``set_symmetric_difference`` when using device policies.
+- Improved performance of search operations (e.g., ``find``, ``all_of``, ``equal``, ``search``, etc.), ``is_heap`` and
+  ``is_heap_until`` algorithms on Intel® Arc™ B-series GPU devices.
+
+Fixed Issues
+------------
+- Removed requirement of GPU double precision support to use ``set_union``, ``set_difference``, ``set_intersection``,
+  and ``set_symmetric_difference`` on Windows operating systems.
+- Removed default-constructible requirements from the value type for ``reduce`` and ``transform_reduce`` algorithms,
+  as well as copy-constructible requirements when these algorithms are used with a native ("host") policy.
+- Fixed an issue with ``ranges::merge`` when projections of the two input ranges were not the same.
+- Fixed ``equal`` returning a ``false`` for empty input sequences; now it returns ``true``.
+- Fixed a compilation error **SYCL kernel cannot use exceptions** occurring with libstdc++ version 10 when calling
+  ``adjacent_find``, ``is_sorted`` and ``is_sorted_until`` range algorithms with device policies.
+- Fixed an issue with ``PSTL_USE_NONTEMPORAL_STORES`` macro having no effect.
+- Fixed a bug where ``unique`` called with a device policy returned an incorrect result iterator.
+- Fixed a bug in ``exclusive_scan``, ``inclusive_scan``, ``transform_exclusive_scan``, ``transform_inclusive_scan``,
+  ``exlusive_scan_by_segment``, and ``inclusive_scan_by_segment`` algorithms when using device policies with different
+  input and output value types.
+- Fixed a bug in return value types of ``minmax_element`` and ``mismatch`` range algorithms.
+- Fixed compile errors in ``set_union`` and ``set_symmetric_difference`` when using device policies
+  with different second-input and output value types.
+
 Known Issues and Limitations
 ----------------------------
 New in This Release
 ^^^^^^^^^^^^^^^^^^^
-- Calling ``histogram`` algorithm with a device execution policy may cause a segmentation fault in
-  Intel® oneAPI DPC++/C++ Compiler 2025.3 when compiling SYCL kernels for CPU devices.
-  To avoid this, define ``ONEDPL_DISABLE_HISTOGRAM_REGISTER_REDUCTION`` macro to a non-zero value
-  prior to including oneDPL header files.
+- ``copy_if``, ``unique_copy``, ``set_union``, ``set_intersection``, ``set_difference``, ``set_symmetric_difference``
+  range algorithms require the output range to have sufficient size to hold all resulting elements.
+
+Existing Issues
+^^^^^^^^^^^^^^^
+See oneDPL Guide for other `restrictions and known limitations`_.
+
+- ``histogram`` algorithm requires the output value type to be an integral type no larger than four bytes
+  when used with a device policy on hardware that does not support 64-bit atomic operations.
+- For ``transform_exclusive_scan`` and ``exclusive_scan`` to run in-place (that is, with the same data
+  used for both input and destination) and with an execution policy of ``unseq`` or ``par_unseq``,
+  it is required that the provided input and destination iterators are equality comparable.
+  Furthermore, the equality comparison of the input and destination iterator must evaluate to true.
+  If these conditions are not met, the result of these algorithm calls is undefined.
+- Incorrect results may be produced by ``exclusive_scan``, ``inclusive_scan``, ``transform_exclusive_scan``,
+  ``transform_inclusive_scan``, ``exclusive_scan_by_segment``, ``inclusive_scan_by_segment``, ``reduce_by_segment``
+  with ``unseq`` or ``par_unseq`` policy when compiled by Intel® oneAPI DPC++/C++ Compiler 2024.1 or earlier
+  with ``-fiopenmp``, ``-fiopenmp-simd``, ``-qopenmp``, ``-qopenmp-simd`` options on Linux.
+  To avoid the issue, pass ``-fopenmp`` or ``-fopenmp-simd`` option instead.
 
 New in 2022.9.0
 ===============
@@ -71,8 +124,6 @@ See oneDPL Guide for other `restrictions and known limitations`_.
   To avoid the issue, pass ``-fopenmp`` or ``-fopenmp-simd`` option instead.
 - With libstdc++ version 10, the compilation error *SYCL kernel cannot use exceptions* occurs
   when calling the range-based ``adjacent_find``, ``is_sorted`` or ``is_sorted_until`` algorithms with device policies.
-- The range-based ``count_if`` may produce incorrect results on Intel® Data Center GPU Max Series when the driver version
-  is "Rolling 2507.12" and newer.
 
 New in 2022.8.0
 ===============