v3.10-rc
Pre-releasePerformance Optimizations
Intel Architecture Processors
- Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2. - Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512. - Improved performance of matmul primitive on processors with Intel AMX support.
- Improved performance of
f32matmul primitive for GEMV cases on on processors with Intel AVX2 instruction set support. - Improved matmul performance with
int4andint8compressed weights and per-channel zero-points. - Improved
f32matmul performance withint4andint8compressed weights on processors with Intel AVX2 and Intel AVX512 instruction set support. - Improved
bf16matmul performance withint4andint8compressed weights on processors with Intel AVX512, Intel DL Boost and bfloat16 instruction set support. - Improved performance of
int8convolution primitive when using zero points. - Improved performance of
int8matmul and inner product primitives withfp16destination. - Improved performance of
f32andbf16convolution primitive withint8destination. - Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
- Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
Intel Graphics Products
- Improved GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
- Improved
int8matmul performance withint4weights and per-tensor zero-points. - Improved
bf16matmul performance withfp8weights. - Graph API optimizations:
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode is enabled on Intel Core Ultra processors (formerly Meteor Lake).
- Improved SDPA and GQA subgraphs performance when using host-side scalars.
- Improved performance of GQA subgraph for 2nd token scenarios.
- Improved performance of subgraphs containing sequence of multiple binary ops.
- Improved performance of Grouped Query Attention (GQA) subgraphs for training forward and backward propagation.
AArch64-based Processors
- Improved performance of reorder primitive
- Improved performance of
bf16convolutions - Improved performance of convolutions on 128-bit SVE platforms
- Improved performance of eltwise on Arm(R) Neoverse(TM) N1
Functionality
Functional API
- Introduced host-side scalar memory objects. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul and convolution primitives on Intel GPUs.
- Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve performance in case of
int8activations andint8weights with zero-point.
Graph API
- Introduced
host_scalarproperty for logical tensors. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs. - Introduced accumulation mode attribute support in
Matmulop. This attribute allows relaxingfp32accumulation requirements to achieve performance benefits on some platforms.
Intel Graphics Products
- Introduced support for
fp4weights in matmul primitive. - Introduced support for grouped quantization with group size 16 in matmul with
int8compressed weights. - Introduced support group size 16
int8for decompressed weight with regular weights decompression.
Intel Architecture Processors
- Introduced
fp4weights support forfp32matmul and convolution for future Intel Xeon processors with Intel AVX10.2 instruction set support.
Usability
- Extended diagnostics available in verbose mode for primitive descriptor creation issues.
- Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Breaking Changes
AArch64-based Processors
- Bumped the minimum required Arm(R) Compute Library 52.4.0
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24,
Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301, Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117, Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva, Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc, @focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0, @zhangjian29, and @xiazhuozhao.