|
| 1 | +# Performance Optimizations |
| 2 | +## Intel Architecture Processors |
| 3 | +* Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support. |
| 4 | + This functionality is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2`. |
| 5 | +* Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512`. |
| 6 | +* Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters. |
| 7 | +* Improved performance of `fp8` convolution primitive with scales and `bf16` output |
| 8 | +* Improved performance of matmul primitive with post-ops on processors with Intel AMX support |
| 9 | +* Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support |
| 10 | +* Improved performance of the following subgraphs with Graph API: |
| 11 | + * [Scaled Dot Product Attention (SDPA)] with implicit causal mask. |
| 12 | + * [Grouped Query Attention (GQA)] flavor specific for GEMMA models. |
| 13 | + |
| 14 | +[Scaled Dot Product Attention (SDPA)]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_graph_sdpa.html |
| 15 | +[Grouped Query Attention (GQA)]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_graph_gqa.html |
| 16 | + |
| 17 | +## Intel Graphics Products |
| 18 | +* Improved performance on Intel GPUs based on Xe3 architecture. |
| 19 | +* Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake). |
| 20 | +* Improved RNN primitive performance with LBR_GRU cell type. |
| 21 | +* Improved `int8` convolution performance with plain weights and trivial filter. |
| 22 | +* Improved convolution performance with `NCHW` activations with 1x1 filter and unit strides. |
| 23 | +* Improved `fp32` softmax performance. |
| 24 | +* Improved performance of reorder when used with USM host memory. |
| 25 | +* Improved performance of the following subgraphs with Graph API: |
| 26 | + * `fp32` SDPA with implicit causal mask. |
| 27 | + * `fp16` SDPA on Intel GPUs without Intel XMX cores. |
| 28 | + |
| 29 | +## AArch64-based Processors |
| 30 | +* Improved `int8` convolution performance. |
| 31 | +* Improved `bf16` depthwise convolution performance. |
| 32 | +* Improved `f16` matmul performance with Arm Compute Library (ACL). |
| 33 | + |
| 34 | +# Functionality |
| 35 | +## Functional API |
| 36 | +* Introduced [Root Mean Square Normalization (RMSNorm) mode] for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs. |
| 37 | +* Sparse memory objects and sparse matmul are promoted to production status. |
| 38 | + |
| 39 | +[Root Mean Square Normalization (RMSNorm) mode]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_layer_normalization.html#root-mean-square-normalization-mode |
| 40 | + |
| 41 | +## Graph API |
| 42 | +* Introduced support for tanh approximation in [`GELU`] operation. |
| 43 | +* Extended Graph API [`Softmax`] operation to support optional `stats` output. |
| 44 | +* Introduced fusion support for SDPA training forward and backward propagation. |
| 45 | +* Introduced fusion support for SDPA with bottom-right implicit causal mask. |
| 46 | +* Introduced `make_scalar_tensor()` API for engine-agnostic scalar tensor creation. |
| 47 | + |
| 48 | +[`GELU`]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_op_gelu.html |
| 49 | +[`SoftMax`]: https://uxlfoundation.github.io/oneDNN/v3.9/dev_guide_op_softmax.html |
| 50 | + |
| 51 | +## Microkernel API |
| 52 | +* Introduced support for `fp8` data type. |
| 53 | + |
| 54 | +## Intel Architecture Processors |
| 55 | +* Introduced support for select algorithm in binary post-op. |
| 56 | +* Introduced source, destination, and weight scales support in `fp8` convolution and deconvolution primitives. |
| 57 | + |
| 58 | +## Intel Graphics Products |
| 59 | +* Introduced support for select algorithm in binary primitive. |
| 60 | + |
| 61 | +## Generic GPU Vendor |
| 62 | +* Introduced support for RNN Vanilla backward propagation. |
| 63 | + |
| 64 | +# Usability |
| 65 | +* Enabled build with `-Wundef` compiler flag. |
| 66 | +* [Experimental] Introduced support for kernel compilation with [SYCL kernel compiler] extension. |
| 67 | + |
| 68 | +[SYCL kernel compiler]: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc |
| 69 | + |
| 70 | +# Validation |
| 71 | +* Improved benchdnn performance by optimizing input data filling and testing results comparison steps. |
| 72 | +* Improved benchdnn graph driver performance mode via adding CPU memory pool for allocator. |
| 73 | + |
| 74 | +# Known Limitations |
| 75 | +* The group normalization with `normalization_flags::use_scale` specified produces incorrect results for backward propagation kind in oneDNN v3.9 and earlier. |
| 76 | +* Binary primitive with certain shapes and Graph API SDPA with bottom right causal mask may hang with SYCL debug runtime on Windows. |
| 77 | +* `fp8` matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics. |
| 78 | +* `int8` inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series. |
| 79 | +* `bf16` pooling with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series. |
| 80 | +* `bf16`/`fp16` matmul with large inner dimension has a performance regression on Intel Datacenter GPU Max Series. |
| 81 | +* `bf16`/`fp16` convolution with `NCHW` activations has a performance regression on Intel Datacenter GPU Max Series. |
| 82 | +* Softmax with non-trivial strides and blocked format may produce incorrect results. |
| 83 | +* `bf16` layer normalization backpropagation may produce incorrect results on Intel Datacenter GPU Max Series. |
| 84 | + |
| 85 | +# Deprecated Functionality |
| 86 | +* [BLAS-like API] including `dnnl::sgemm`, `dnnl::gemm_u8s8s32`, and `dnnl::gemm_s8s8s32` functions is deprecated and will be removed in future releases. If you are using this API consider switching to [matmul primitive]. |
| 87 | + |
| 88 | +[BLAS-like API]: https://uxlfoundation.github.io/oneDNN/v3.8/group_dnnl_api_blas.html |
| 89 | +[matmul primitive]: https://uxlfoundation.github.io/oneDNN/v3.8/dev_guide_matmul.html |
| 90 | + |
| 91 | +# Thanks to our Contributors |
| 92 | +This release contains contributions from the [project core team] as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues. |
| 93 | + |
| 94 | +[project core team]: https://github.com/uxlfoundation/oneDNN/blob/rls-v3.9/MAINTAINERS.md |
0 commit comments