v3.5-rc
Pre-releaseThis is a release candidate for oneDNN v3.5. Please provide feedback and submit defect reports via Github issues.
Performance Optimizations
-
Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids).
- Improved performance of group normalization primitive.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Improved performance of the following subgraphs with Graph API:
- Multi-Query Attention (MQA).
- Scaled Dot Product Attention (SDPA), including the variant with
select
operation. LayerNorm
+Multiply
+Quantize
produced by SmoothQuant algorithm.Convolution
+Sigmoid
+Multiply
with mixed precisions.
-
Intel Graphics Products:
- Improved performance for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved RNN primitive performance for LSTM cell case.
- Improved performance of f8_e4m3 data type emulation on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
-
AArch64-based Processors:
- Improved convolution forward propagation, matmul, and softmax performance for processors with SVE support.
- Improved bf16 matmul performance with Arm Compute Library (ACL).
- Improved eltwise primitive performance with
gelu_erf
algorithm with ACL.
Functionality
- Introduced sum and binary post-ops support for layer normalization primitive. This functionality is currently implemented on CPUs only.
- Introduced support for int4 data type and extended quantization model with support for grouped scales and zero points.
- Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs only.
- Extended floating point math mode API to support weight decompression scenarios. See matmul weights decompression example to get started. New floating mode is supported in the following configurations:
- bfloat16 matmul with int8 weights on Intel CPUs.
- float16 and bfloat16 matmul with int8 or int4 weights on Intel GPUs.
- [experimental] Introduced microkernel API for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
Usability
- Extended error messages for engine and memory objects creation errors.
- Extended verbose mode diagnostics with information on dispatching decisions for all primitives.
- Introduced support for clang++ host compiler in SYCL builds.
- Introduced API for tensor serialization and deserialization.
- Extended verbose mode diagnostics for Graph API with information on pattern matcher decisions.
- Introduced OpenCL runtime support for Graph API.
- Added support for building oneDNN with installed Arm Compute Library (ACL).
Validation
- Extended benchdnn with support for tensor tags in RNN primitive validation.
Thanks to these Contributors
This release contains contributions from the project core team as well as @AngryLoki, Crefeda Rodrigues @cfRod, Daniel Richard G. @iskunk, @deepeshfujitsu, Dylan Angus @dylan-angus-codeplay, Emanuele Rocca @ema, Hernan Martinez @hmartinez82, John Osorio @kala855, Jonathan Deakin @jondea, @kasturedeeksha, Kentaro Kawakami @kawakami-k, Nikita Shulga @malfet, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Roman Zhukov @rozhukov, Shreyas-fuj @Shreyas-fuj, Sunita Nadampalli @snadampal, Tadej Ciglarič @t4c1, Vineel Abhinav @vineelabhinav, @vishwascm. We would also like to thank everyone who asked questions and reported issues.