Skip to content

Commit d2a5702

Browse files
vgvozdevaRbiessyvpirogovCopilotElaineBao
authored
doc: oneDNN v3.9 release notes
Co-authored-by: Romain Biessy <[email protected]> Co-authored-by: Vadim Pirogov <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: YixinBao <[email protected]> Co-authored-by: Primak, Tatyana <[email protected]> Co-authored-by: Chereshnev, Eugene <[email protected]> Co-authored-by: Lv, Tao A <[email protected]> Co-authored-by: Dmitry Zarukin <[email protected]>
1 parent fb9f3e5 commit d2a5702

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

RELEASE_NOTES.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Performance Optimizations
2+
## Intel Architecture Processors
3+
* Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
4+
This functionality is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2`.
5+
* Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512`.
6+
* Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
7+
* Improved performance of `fp8` convolution primitive with scales and `bf16` output
8+
* Improved performance of matmul primitive with post-ops on processors with Intel AMX support
9+
* Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support
10+
* Improved performance of the following subgraphs with Graph API:
11+
* [Scaled Dot Product Attention (SDPA)] with implicit causal mask.
12+
* [Grouped Query Attention (GQA)] flavor specific for GEMMA models.
13+
14+
[Scaled Dot Product Attention (SDPA)]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_graph_sdpa.html
15+
[Grouped Query Attention (GQA)]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_graph_gqa.html
16+
17+
## Intel Graphics Products
18+
* Improved performance on Intel GPUs based on Xe3 architecture.
19+
* Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
20+
* Improved RNN primitive performance with LBR_GRU cell type.
21+
* Improved `int8` convolution performance with plain weights and trivial filter.
22+
* Improved convolution performance with `NCHW` activations with 1x1 filter and unit strides.
23+
* Improved `fp32` softmax performance.
24+
* Improved performance of reorder when used with USM host memory.
25+
* Improved performance of the following subgraphs with Graph API:
26+
* `fp32` SDPA with implicit causal mask.
27+
* `fp16` SDPA on Intel GPUs without Intel XMX cores.
28+
29+
## AArch64-based Processors
30+
* Improved `int8` convolution performance.
31+
* Improved `bf16` depthwise convolution performance.
32+
* Improved `f16` matmul performance with Arm Compute Library (ACL).
33+
34+
# Functionality
35+
## Functional API
36+
* Introduced [Root Mean Square Normalization (RMSNorm) mode] for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
37+
* Sparse memory objects and sparse matmul are promoted to production status.
38+
39+
[Root Mean Square Normalization (RMSNorm) mode]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_layer_normalization.html#root-mean-square-normalization-mode
40+
41+
## Graph API
42+
* Introduced support for tanh approximation in [`GELU`] operation.
43+
* Extended Graph API [`Softmax`] operation to support optional `stats` output.
44+
* Introduced fusion support for SDPA training forward and backward propagation.
45+
* Introduced fusion support for SDPA with bottom-right implicit causal mask.
46+
* Introduced `make_scalar_tensor()` API for engine-agnostic scalar tensor creation.
47+
48+
[`GELU`]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_op_gelu.html
49+
[`SoftMax`]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_op_softmax.html
50+
51+
## Microkernel API
52+
* Introduced support for `fp8` data type.
53+
54+
## Intel Architecture Processors
55+
* Introduced support for select algorithm in binary post-op.
56+
* Introduced source, destination, and weight scales support in `fp8` convolution and deconvolution primitives.
57+
58+
## Intel Graphics Products
59+
* Introduced support for select algorithm in binary primitive.
60+
61+
## Generic GPU Vendor
62+
* Introduced support for RNN Vanilla backward propagation.
63+
64+
# Usability
65+
* Enabled build with `-Wundef` compiler flag.
66+
* [Experimental] Introduced support for kernel compilation with [SYCL kernel compiler] extension.
67+
68+
[SYCL kernel compiler]: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc
69+
70+
# Validation
71+
* Improved benchdnn performance by optimizing input data filling and testing results comparison steps.
72+
* Improved benchdnn graph driver performance mode via adding CPU memory pool for allocator.
73+
74+
# Known Limitations
75+
* The group normalization with `normalization_flags::use_scale` specified produces incorrect results for backward propagation kind in oneDNN v3.9 and earlier.
76+
* Binary primitive with certain shapes and Graph API SDPA with bottom right causal mask may hang with SYCL debug runtime on Windows.
77+
* `fp8` matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.
78+
* `int8` inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.
79+
* `bf16` pooling with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.
80+
* `bf16`/`fp16` matmul with large inner dimension has a performance regression on Intel Datacenter GPU Max Series.
81+
* `bf16`/`fp16` convolution with `NCHW` activations has a performance regression on Intel Datacenter GPU Max Series.
82+
* Softmax with non-trivial strides and blocked format may produce incorrect results.
83+
* `bf16` layer normalization backpropagation may produce incorrect results on Intel Datacenter GPU Max Series.
84+
85+
# Deprecated Functionality
86+
* [BLAS-like API] including `dnnl::sgemm`, `dnnl::gemm_u8s8s32`, and `dnnl::gemm_s8s8s32` functions is deprecated and will be removed in future releases. If you are using this API consider switching to [matmul primitive].
87+
88+
[BLAS-like API]: https://uxlfoundation.github.io/oneDNN/v3.8/group_dnnl_api_blas.html
89+
[matmul primitive]: https://uxlfoundation.github.io/oneDNN/v3.8/dev_guide_matmul.html
90+
91+
# Thanks to our Contributors
92+
This release contains contributions from the [project core team] as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.
93+
94+
[project core team]: https://github.com/uxlfoundation/oneDNN/blob/rls-v3.9/MAINTAINERS.md

0 commit comments

Comments
 (0)