|
| 1 | +# Performance Optimizations |
| 2 | +## Intel Architecture Processors |
| 3 | +* Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support. |
| 4 | + This functionality is not dispatched by default and requires opt-in with environment |
| 5 | + variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2`. |
| 6 | +* Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality |
| 7 | + is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512`. |
| 8 | +* Improved performance of matmul primitive on processors with Intel AMX support. |
| 9 | +* Improved performance of `f32` matmul primitive for GEMV cases on on processors with Intel AVX2 instruction |
| 10 | + set support. |
| 11 | +* Improved matmul performance with `int4` and `int8` compressed weights and per-channel zero-points. |
| 12 | +* Improved `f32` matmul performance with `int4` and `int8` compressed weights on processors with Intel AVX2 and |
| 13 | + Intel AVX512 instruction set support. |
| 14 | +* Improved `bf16` matmul performance with `int4` and `int8` compressed weights on processors with Intel AVX512, |
| 15 | + Intel DL Boost and bfloat16 instruction set support. |
| 16 | +* Improved performance of `int8` convolution primitive when using zero points. |
| 17 | +* Improved performance of `int8` matmul and inner product primitives with `fp16` destination. |
| 18 | +* Improved performance of `f32` and `bf16` convolution primitive with `int8` destination. |
| 19 | +* Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime. |
| 20 | +* Improved performance of subgraphs containing sequence of multiple binary ops with Graph API. |
| 21 | + |
| 22 | +## Intel Graphics Products |
| 23 | +* Improve GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake). |
| 24 | +* Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly Alchemist) and |
| 25 | + Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H). |
| 26 | +* Improved `int8` matmul performance with `int4` weights and per-tensor zero-points. |
| 27 | +* Improved `bf16` matmul performance with `fp8` weights. |
| 28 | +* Graph API optimizations: |
| 29 | + * Improved [Scaled Dot Product Attention (SDPA)] subgraph performance for inference when relaxed accumulation mode |
| 30 | + is enabled on Intel Core Ultra processors (formerly Meteor Lake). |
| 31 | + * Improved SDPA and GQA subgraphs performance when using host-side scalars. |
| 32 | + * Improved performance of GQA subgraph for 2nd token scenarios. |
| 33 | + * Improved performance of subgraphs containing sequence of multiple binary ops. |
| 34 | + * Improved performance of [Grouped Query Attention (GQA)] subgraphs for training forward and backward propagation. |
| 35 | + |
| 36 | +[Grouped Query Attention (GQA)]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_graph_gqa.html#gqa-for-training-forward-propagation |
| 37 | +[Scaled Dot Product Attention (SDPA)]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_graph_sdpa.html |
| 38 | + |
| 39 | +## AArch64-based Processors |
| 40 | +* Improved reorder primitive performance. |
| 41 | +* Improved `bf16` convolutions performance. |
| 42 | +* Improved convolutions performance on CPUs with 128-bit SVE support. |
| 43 | +* Improved eltwise primitive performance on Arm(R) Neoverse(TM) N1 processor. |
| 44 | + |
| 45 | +# Functionality |
| 46 | +## Functional API |
| 47 | +* Introduced [host-side scalar memory objects]. This functionality allows passing host-side scalars instead of device |
| 48 | + memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul |
| 49 | + and convolution primitives on Intel GPUs. |
| 50 | +* Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve |
| 51 | + performance in case of `int8` activations and `int8` weights with zero-point. |
| 52 | + |
| 53 | +[host-side scalar memory objects]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_host_side_scalars.html |
| 54 | + |
| 55 | +## Graph API |
| 56 | +* Introduced [`host_scalar` property] for logical tensors. This functionality allows passing host-side scalars instead |
| 57 | + of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to |
| 58 | + define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs. |
| 59 | +* Introduced [accumulation mode attribute] support in `Matmul` op. This attribute allows relaxing `fp32` accumulation |
| 60 | + requirements to achieve performance benefits on some platforms. |
| 61 | + |
| 62 | +[`host_scalar` property]: https://uxlfoundation.github.io/oneDNN/v3.10/enum_dnnl_graph_logical_tensor_property_type.html |
| 63 | +[accumulation mode attribute]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_op_matmul.html |
| 64 | + |
| 65 | +## Intel Graphics Products |
| 66 | +* Introduced support for `fp4` weights in matmul primitive. |
| 67 | +* Introduced support for weight scales and zero-points with group size 16 in matmul with compressed weights. |
| 68 | + |
| 69 | +## Intel Architecture Processors |
| 70 | +* Introduced `fp4` weights support for `fp32` matmul and convolution for future Intel Xeon processors with |
| 71 | + Intel AVX10.2 instruction set support. |
| 72 | + |
| 73 | +# Usability |
| 74 | +* Extended diagnostics available in verbose mode for primitive descriptor creation issues. |
| 75 | +* Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs. |
| 76 | + |
| 77 | +# Known Limitations |
| 78 | +* Convolution primitive may require excessive amount of scratchpad memory for shapes with large input width value on Intel CPUs. |
| 79 | +* `bf16` convolution primitive has a performance regression on Intel Arc B-series graphics. |
| 80 | +* Reduction primitive may produce incorrect results for tensors exceeding 4 GB on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H). |
| 81 | +* Concat primitive may produce incorrect results for certain shapes on Intel Arc A-series GPUs. |
| 82 | +* `fp16` matmul primitive has a performance regression on Intel GPUs based on Xe2 architecture. |
| 83 | +* `f32` matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics. |
| 84 | +* `int8` inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series. |
| 85 | +* `bf16` layer normalization backpropagation may produce incorrect results on Intel Datacenter GPU Max Series. |
| 86 | + |
| 87 | +# Deprecated Functionality |
| 88 | +* [BLAS-like API] including `dnnl::sgemm`, `dnnl::gemm_u8s8s32`, and `dnnl::gemm_s8s8s32` functions is deprecated |
| 89 | + and will be removed in future releases. If you are using this API consider switching to [matmul primitive]. |
| 90 | + |
| 91 | +[BLAS-like API]: https://uxlfoundation.github.io/oneDNN/v3.10/group_dnnl_api_blas.html |
| 92 | +[matmul primitive]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_matmul.html |
| 93 | + |
| 94 | +# Breaking Changes |
| 95 | +## AArch64-based Processors |
| 96 | +* Bumped the minimum required [Arm(R) Compute Library](https://github.com/ARM-software/ComputeLibrary) version to 52.4.0 |
| 97 | + |
| 98 | +# Thanks to our Contributors |
| 99 | +This release contains contributions from the [project core team] as well as Andrei Hutu @Anndrey24, |
| 100 | +Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301, |
| 101 | +Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117, |
| 102 | +Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, |
| 103 | +Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva, |
| 104 | +Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc, |
| 105 | +@focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0, |
| 106 | +@zhangjian29, and @xiazhuozhao. |
| 107 | + |
| 108 | +[project core team]: https://github.com/uxlfoundation/oneDNN/blob/rls-v3.10/MAINTAINERS.md |
0 commit comments