@@ -12,9 +12,13 @@ is an implementation of the Decoupled Look-back [#fnote1]_ scan algorithm.
1212
1313The algorithm is designed to be compatible with a variety of devices that provide at least parallel
1414forward progress guarantees between work-groups, due to cross-work-group communication. Additionally, it
15- requires support for device USM (Unified Shared Memory). It has been verified to be compatible
16- with `Intel® Data Center GPU Max Series
17- <https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series/products.html> `_.
15+ requires support for device USM (Unified Shared Memory) and sub-group size of 32. It has been verified to be compatible
16+ with `Intel® Data Center GPU Max 1100
17+ <https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html> `_
18+ , `Intel® Data Center GPU Max 1550
19+ <https://www.intel.com/content/www/us/en/products/sku/232873/intel-data-center-gpu-max-1550/specifications.html> `_
20+ , and `Intel® Arc™ B580 Graphics
21+ <https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html> `_.
1822
1923A synopsis of the ``inclusive_scan `` function is provided below:
2024
@@ -69,7 +73,8 @@ Parameters
6973
7074**Type Requirements **:
7175
72- - The element type of sequence to scan must be a 32-bit or 64-bit bit C++ integral or floating-point type.
76+ - The element type of sequence to scan must be an 8-bit, 16-bit, 32-bit, or 64-bit C++ integral or floating-point
77+ type.
7378- The result is non-deterministic if the binary operator is non-associative (such as in floating-point addition)
7479 or non-commutative.
7580
@@ -81,9 +86,6 @@ Parameters
8186 - The function is intended to be asynchronous, but in some cases, the function will not return until the algorithm fully completes.
8287 Although intended in the future to be an asynchronous call, the algorithm is currently synchronous.
8388 - The SYCL device associated with the provided queue must support 64-bit atomic operations if the element type is 64-bits.
84- - There must be a known identity value for the provided combination of the element type and the binary operation. That is,
85- ``sycl::has_known_identity_v `` must evaluate to true. Such operators are listed in
86- the `SYCL 2020 specification <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.identities >`_.
8789
8890Return Value
8991------------
@@ -145,18 +147,19 @@ inclusive_scan Example
145147Memory Requirements
146148-------------------
147149
148- The algorithm uses global and local device memory (see `SYCL 2020 Specification
150+ The algorithm uses global, local, and private device memory (see `SYCL 2020 Specification
149151<https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_sycl_device_memory_model> `__)
150152for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device.
151153If there is not enough global device memory, a ``std::bad_alloc `` exception is thrown.
152- The behavior is undefined if there is not enough local memory.
153- The amount of memory that is required depends on input data and configuration parameters, as described below.
154+ The behavior is undefined if there is not enough local memory. If there is insufficient private register memory, then
155+ algorithmic performance will degrade. The amount of memory that is required depends on input data and configuration
156+ parameters, as described below.
154157
155158Global Memory Requirements
156159--------------------------
157160
158161Global memory is used for copying the input sequence and storing internal data such as status flags.
159- The used amount depends on many parameters; below is an approximation in bytes:
162+ The used amount depends on many parameters; below is an upper bound approximation in bytes:
160163
1611642 * V * N \ :sub: `flags` + 4 * N \ :sub: `flags`
162165
@@ -174,11 +177,19 @@ It can be approximated by dividing the number of input elements N by the product
174177Local Memory Requirements
175178-------------------------
176179
177- Local memory is used for storing elements of the input that are to be scanned by a single work-group.
178- The used amount is denoted as N\ :sub: `elems_per_workgroup`, which equals to ``sizeof(key_type) * param.data_per_workitem * param.workgroup_size ``.
180+ Local memory is used for storing partial scan computations per sub-group in a work-group.
181+ The used amount is denoted as N\ :sub: `sub_group_carries`, which equals ``sizeof(key_type) * param.workgroup_size / sub_group_size ``
182+ where ``sub_group_size `` is the size of the sub-group currently fixed to 32.
179183
180- Some amount of local memory is also used by the calls to SYCL's group reduction and group scan. The amount of memory used particularly
181- for these calls is implementation dependent.
184+ Private Memory Requirements
185+ ---------------------------
186+
187+ The implementation is most performant when all private memory is allocated to registers and does not spill into global
188+ memory scratch space reserved for the kernel. The amount of private memory used per work-group is ``V * W * D + ε ``
189+ where V is the number of bytes needed to store the input value type, W is ``param.workgroup_size ``, D is
190+ ``param.data_per_workitem ``, and ε is the remaining private memory used by local variables and the binary operation. ε
191+ is expected to carry a small footprint in most common use cases. If the binary operation uses many registers, then the
192+ impact of ε may be of greater significance.
182193
183194-----------------------------------------
184195Recommended Settings for Best Performance
@@ -195,6 +206,12 @@ The initial configuration may be selected according to these high-level guidelin
195206 compute cores is key for better performance. To allow sufficient work to satisfy all
196207 X\ :sup: `e`-cores [#fnote2 ]_ on a GPU, use ``param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count ``.
197208
209+ - For large inputs that fully saturate compute cores, maximizing ``param.workgroup_size `` and ``param.data_per_workitem ``
210+ without spilling out of register memory results in best performance. The Intel® oneAPI DPC++ Compiler reports warnings
211+ when register spillage occurs. This may be used alongside guidance provided in the
212+ `oneAPI GPU Optimization Guide <https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-2/registers-and-performance.html >`_
213+ and benchmarking parameter sweeps to determine performant kernel template parameters for your use case.
214+
198215- On devices with multiple tiles, it may prove beneficial to experiment with different tile hierarchies as described
199216 in `Options for using a GPU Tile Hierarchy <https://www.intel.com/content/www/us/en/developer/articles/technical/flattening-gpu-tile-hierarchy.html >`_.
200217
0 commit comments