Releases: uxlfoundation/oneDNN
v2.1.2
This is a patch release containing the following changes to v2.1.1:
v2.2-rc
This is a release candidate for oneDNN v2.2. Please provide feedback and submit defect reports via Github issues.
v2.1.1
This is a patch release containing the following changes to v2.1:
- Improved performance of fp32 depthwise convolution with plain activations on CPU (762a9c7)
- Worked around internal compiler error in GCC 7.3.1 when building with
--std=c++14(f637501) - Fixed memory leaks in batchnorm and gemm implementations (2ea5385, 4f3a7cf)
- Addressed several issues in benchdnn and gtests (bb7bdb4, 0e04cc2, d7df8d2, a59354f)
v2.1
Performance optimizations
-
Reduced overheads associated with primitive cache.
-
Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of Winograd convolution.
- Improved functionality performance for padded memory formats.
- Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
- Improved performance of pooling primitive for float16 data type.
- Improved performance of lnorm primitive for plain formats.
- Improved performance of resampling primitive for blocked formats.
-
Intel Architecture processors
- Introduced initial optimizations for bfloat16 functionality for future Intel Xeon Scalable processor with Intel AMX support (code name Sapphire Rapids).
- Improved performance of int8 and bfloat16 RNN and inner product primitives.
- Improved performance of shuffle primitive for bfloat16 data type.
- Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel AVX512 compute unit.
- Improved forward convolution performance for Intel AVX-512 systems.
- Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
- Improved convolution and batch normalization performance with threadpool.
-
AArch64-based processors
- Improved performance of Winograd convolution with ArmCL.
- Improved performance of int8 convolution with ArmCL.
- Added JIT support for Aarch64 and JIT implementations for reorder, eltwise, pooling, and batch normalization primitives.
-
NVIDIA GPUs
- (preview) Introduced support for NVIDIA GPU. The implementation relies on DPC++ Compiler, cuDNN, and cuBLAS libraries.
New Functionality
- Introduced int8 support for LSTM primitive with projection for CPU.
- Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs.
- Extended the number of supported post-ops for primitives to 20.
- Extended eltwise primitive with support for
logsigmoidandclip_v2algorithms. - Introduced support for PRelu primitive.
- Extended matmul implementation with support for per-output channel zero-points for quantization.
- Extended support for broadcasting in binary primitive to both inputs for CPU.
- Introduced float16 support in reduction primitive for GPU.
- Introduced support for mixed input and output types in binary primitive for GPU.
Usability
- Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.
Validation
- Extended benchdnn to report operation bandwidth.
- Added ability to choose target GPU in benchdnn.
Thanks to the contributors
This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Benjamin Fitch, Ben Tracy @CodeplayBen, Daniel Soutar @danielsoutar, @dylan-angus-codeplay, Diana Bite @diaena, higuchi.motoko @higuchi-motoko, Jacob Kahn @jacobkahn, Kentaro Kawakami @kawakami-k, Kumudha KN @KumudhaN, kurihara @Koji-Kurihara, Mehdi Goli @mehdi-goli, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, Xinyu Chen @xinyu-intel, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.
v2.1-rc
This is a release candidate for oneDNN v2.1. Please provide feedback and report bugs in Github issues.
v1.8.1
This is a patch release containing the following changes to v1.8:
- Fixed performance regression for fp32 convolutions forward propagation on Intel Processor Graphics and Xe architecture-based Graphics (2c8d206, d8d6807)
- Fixed segmentation fault for fp32 and bfloat16 convolutions with huge spatial dimensions on processors with Intel AVX2 and Intel AVX512 support (fe8487d, cb8ef4e)
- Fixed correctness issue in depthwise convolution (groups = channels) weight gradient with non-trivial padding and strides on Intel64 processors (b7ffe48)
- Fixed correctness issue in int8 convolution with 1x1 filter and non-trivial padding on Intel Processor Graphics and Xe architecture-based Graphics (5b4201c)
- Fixed performance regression for dnnl_sgemm, fp32 matmul and inner product on Inte64 processors and improved this functionality performance with threadpool threading (32c1110)
v1.8
Performance optimizations
- Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of Winograd convolution.
- Intel Architecture processors
- Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
- Improved performance of int8 and bfloat16 RNN and Inner Product primitives.
- AArch64-based processors
- Improved performance of Winograd convolution with ArmCL
- Improved performance of int8 convolution with ArmCL
- Added JIT support for Aarch64 and JIT reorder implementation
New Functionality
- Introduced int8 support for LSTM primitive with projection for CPU.
Thanks to the contributors
This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, Arthur Mitrano @aaraujom, Benjamin Fitch, Diana Bite @diaena, Kentaro Kawakami @kawakami-k, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.
v1.8-rc
This is a release candidate for oneDNN v1.8. Please provide feedback and report bugs in Github issues.
v2.0
This is a major oneDNN release based on oneDNN v1.7.
Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.
Breaking API changes
- OpenCL API:
- OpenCL interoperability API moved to
dnnl_ocl.hpp. - Engine, stream, and memory are created from corresponding CL objects using free functions.
- OpenCL interoperability API moved to
- Threadpool
- Threadpool API is moved to
dnnl_threadpool.hpp. - Stream object for threadpool is created using free function
dnnl::threadpool_interop::make_stream. - Removed stream attributes.
- Threadpool API is moved to
New Functionality
- Introduced SYCL API extensions compliant with oneAPI specification v1.0.
- Introduced support for Intel(R) DPC++ Compiler and Level Zero runtime.
- Introduced Unified Shared Memory (USM) support for Intel Processor Graphics and Xe architecture-based graphics.
Known Issues and Limitations
- Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
- Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
- When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
o On Linux* (See more details at OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux):
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
o On Windows* (See more details at Timeout Detection and Recovery (TDR) Registry Keys):
Increase TdrDelay and TdrDdiDelay values in registry - See DPC++ limitations that impact the library as well.
v1.7
Performance optimizations
- Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of convolutions and matmul primitives.
- Improved performance of int8 convolutions for NHWC activations format.
- Intel Architecture processors:
- Improved performance of primitives for NHWC activations format.
- Improved fp32 GEMM performance for small N
- Improved performance of int8 primitives for processors with Intel SSE4.1 instruction set support.
- AArch64-based processors
- Added support for Arm Performance Library (ArmPL). ArmPL provides optimized GEMM implementation for aarch64.
- Added support for Arm Compute Library (ArmCL). ArmCL provides optimized convolution implementation for aarch64.
New Functionality
- Added support for IBMz (s390x) and IBM POWER (powerpc64) architectures
- Introduced RNN GRU for GPU.
- Introduced int8 RNN GRU for CPU
- Introduced asymmetric quantization support for convolutions and matmul
- Introduced dilated pooling support.
- Extended matmul primitive to support multiple dimensions in batch and broadcast on CPU.
- (preview) Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, and matmul.
- (preview) Extended the number of supported post-ops for primitives to 20.
- (preview) Introduced reduction primitive for CPU. Together with post-ops this functionality allows to implement normalization.
Thanks to the contributors
This release contains contributions from the project core team as well as Ben Fitch, Brian Shi, David Edelsohn @edelsohn, Diana Bite @diaena, Moaz Reyad @moazreyad, Nathan John Sircombe @nSircombe, Niels Dekker @N-Dekker, Peter Caday @petercad, Pinzhen Xu @pinzhenx, pkubaj @pkubaj, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.