v3.3
Performance Optimizations
- Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
- Improved s32 binary primitive performance.
- Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
- Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
- Improved performance of convolution for depthwise cases with Graph API.
- [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
- Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced RNN primitive initialization time on Intel GPUs.
- AArch64-based Processors:
- Improved fp32 to bf16 reorder performance.
- Improved max pooling performance with Arm Compute Library (ACL).
- Improved dilated convolution performance for depthwise cases with ACL.
Functionality
- Introduced group normalization primitive support. The functionality is currently available on CPUs.
- Intel CPUs:
- Introduced support for zero points in int8 convolution with groups and 3D spatial.
Usability
- Extended verbose mode output:
- Improved diagnostics on engine creation errors.
- Added information on Graph API calls.
- Added information on strides for non-dense memory objects.
- Added values of runtime dimension.
- Added indication that primitive descriptor was created with
any
memory format tag.
- Introduced examples for Graph API.
- Graph API constant tensor cache is now disabled by default and requires opt-in with
dnnl::graph::set_constant_tensor_cache()
call. - Reduced oneDNN Graph API memory consumption in certain scenarios.
Validation
- Extended benchdnn performance reporting with primitive creation time.
- Introduced cold cache mode in benchdnn.
Known Limitations
- Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
- Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel Arc GPUs on Windows.
- Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel Datacenter GPU Max Series.
- Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
- Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel AMX support.
- Int8 softmax may fail crash on Windows in SYCL debug configuration.
Thanks to these Contributors
This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Ilya Lavrenov @ilya-lavrenov, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, Renato Barros Arantes @renato-arantes, @snadampal, @sparkyrider, and Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.