Skip to content

Commit 0a1f0f1

Browse files
vgvozdevaCopilotvpirogovTaoLvElaineBao
authored
doc: oneDNN v3.10 release notes
Signed-off-by: Siddhartha Menon <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Vadim Pirogov <[email protected]> Co-authored-by: Tao Lv <[email protected]> Co-authored-by: YixinBao <[email protected]> Co-authored-by: Primak, Tatyana <[email protected]> Co-authored-by: Siddhartha Menon <[email protected]>
1 parent c111beb commit 0a1f0f1

File tree

1 file changed

+108
-0
lines changed

1 file changed

+108
-0
lines changed

RELEASE_NOTES.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Performance Optimizations
2+
## Intel Architecture Processors
3+
* Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
4+
This functionality is not dispatched by default and requires opt-in with environment
5+
variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2`.
6+
* Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality
7+
is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512`.
8+
* Improved performance of matmul primitive on processors with Intel AMX support.
9+
* Improved performance of `f32` matmul primitive for GEMV cases on on processors with Intel AVX2 instruction
10+
set support.
11+
* Improved matmul performance with `int4` and `int8` compressed weights and per-channel zero-points.
12+
* Improved `f32` matmul performance with `int4` and `int8` compressed weights on processors with Intel AVX2 and
13+
Intel AVX512 instruction set support.
14+
* Improved `bf16` matmul performance with `int4` and `int8` compressed weights on processors with Intel AVX512,
15+
Intel DL Boost and bfloat16 instruction set support.
16+
* Improved performance of `int8` convolution primitive when using zero points.
17+
* Improved performance of `int8` matmul and inner product primitives with `fp16` destination.
18+
* Improved performance of `f32` and `bf16` convolution primitive with `int8` destination.
19+
* Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
20+
* Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
21+
22+
## Intel Graphics Products
23+
* Improve GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
24+
* Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly Alchemist) and
25+
Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
26+
* Improved `int8` matmul performance with `int4` weights and per-tensor zero-points.
27+
* Improved `bf16` matmul performance with `fp8` weights.
28+
* Graph API optimizations:
29+
* Improved [Scaled Dot Product Attention (SDPA)] subgraph performance for inference when relaxed accumulation mode
30+
is enabled on Intel Core Ultra processors (formerly Meteor Lake).
31+
* Improved SDPA and GQA subgraphs performance when using host-side scalars.
32+
* Improved performance of GQA subgraph for 2nd token scenarios.
33+
* Improved performance of subgraphs containing sequence of multiple binary ops.
34+
* Improved performance of [Grouped Query Attention (GQA)] subgraphs for training forward and backward propagation.
35+
36+
[Grouped Query Attention (GQA)]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_graph_gqa.html#gqa-for-training-forward-propagation
37+
[Scaled Dot Product Attention (SDPA)]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_graph_sdpa.html
38+
39+
## AArch64-based Processors
40+
* Improved reorder primitive performance.
41+
* Improved `bf16` convolutions performance.
42+
* Improved convolutions performance on CPUs with 128-bit SVE support.
43+
* Improved eltwise primitive performance on Arm(R) Neoverse(TM) N1 processor.
44+
45+
# Functionality
46+
## Functional API
47+
* Introduced [host-side scalar memory objects]. This functionality allows passing host-side scalars instead of device
48+
memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul
49+
and convolution primitives on Intel GPUs.
50+
* Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve
51+
performance in case of `int8` activations and `int8` weights with zero-point.
52+
53+
[host-side scalar memory objects]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_host_side_scalars.html
54+
55+
## Graph API
56+
* Introduced [`host_scalar` property] for logical tensors. This functionality allows passing host-side scalars instead
57+
of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to
58+
define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs.
59+
* Introduced [accumulation mode attribute] support in `Matmul` op. This attribute allows relaxing `fp32` accumulation
60+
requirements to achieve performance benefits on some platforms.
61+
62+
[`host_scalar` property]: https://uxlfoundation.github.io/oneDNN/v3.10/enum_dnnl_graph_logical_tensor_property_type.html
63+
[accumulation mode attribute]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_op_matmul.html
64+
65+
## Intel Graphics Products
66+
* Introduced support for `fp4` weights in matmul primitive.
67+
* Introduced support for weight scales and zero-points with group size 16 in matmul with compressed weights.
68+
69+
## Intel Architecture Processors
70+
* Introduced `fp4` weights support for `fp32` matmul and convolution for future Intel Xeon processors with
71+
Intel AVX10.2 instruction set support.
72+
73+
# Usability
74+
* Extended diagnostics available in verbose mode for primitive descriptor creation issues.
75+
* Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
76+
77+
# Known Limitations
78+
* Convolution primitive may require excessive amount of scratchpad memory for shapes with large input width value on Intel CPUs.
79+
* `bf16` convolution primitive has a performance regression on Intel Arc B-series graphics.
80+
* Reduction primitive may produce incorrect results for tensors exceeding 4 GB on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
81+
* Concat primitive may produce incorrect results for certain shapes on Intel Arc A-series GPUs.
82+
* `fp16` matmul primitive has a performance regression on Intel GPUs based on Xe2 architecture.
83+
* `f32` matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.
84+
* `int8` inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.
85+
* `bf16` layer normalization backpropagation may produce incorrect results on Intel Datacenter GPU Max Series.
86+
87+
# Deprecated Functionality
88+
* [BLAS-like API] including `dnnl::sgemm`, `dnnl::gemm_u8s8s32`, and `dnnl::gemm_s8s8s32` functions is deprecated
89+
and will be removed in future releases. If you are using this API consider switching to [matmul primitive].
90+
91+
[BLAS-like API]: https://uxlfoundation.github.io/oneDNN/v3.10/group_dnnl_api_blas.html
92+
[matmul primitive]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_matmul.html
93+
94+
# Breaking Changes
95+
## AArch64-based Processors
96+
* Bumped the minimum required [Arm(R) Compute Library](https://github.com/ARM-software/ComputeLibrary) version to 52.4.0
97+
98+
# Thanks to our Contributors
99+
This release contains contributions from the [project core team] as well as Andrei Hutu @Anndrey24,
100+
Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301,
101+
Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117,
102+
Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw,
103+
Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva,
104+
Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc,
105+
@focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0,
106+
@zhangjian29, and @xiazhuozhao.
107+
108+
[project core team]: https://github.com/uxlfoundation/oneDNN/blob/rls-v3.10/MAINTAINERS.md

0 commit comments

Comments
 (0)