12 Mar 23:29

tprimak

98be7e8

v2.1.2

This is a patch release containing the following changes to v2.1.1:

Improved performance of forward convolution with plain activations for processors with Intel AVX-512 support (2147a58)
Enabled I-cache refreshing before executing JIT-ed code for AArch64 systems (9f3bc1c)
Returned blocked layouts as default for forward training (7af2898, bd4826d)

Assets 2

12 Mar 23:23

vpirogov

v2.2-rc

d03933d

v2.2-rc Pre-release

Pre-release

This is a release candidate for oneDNN v2.2. Please provide feedback and submit defect reports via Github issues.

Assets 2

25 Feb 21:06

vpirogov

v2.1.1

53f53c2

v2.1.1

This is a patch release containing the following changes to v2.1:

Improved performance of fp32 depthwise convolution with plain activations on CPU (762a9c7)
Worked around internal compiler error in GCC 7.3.1 when building with --std=c++14 (f637501)
Fixed memory leaks in batchnorm and gemm implementations (2ea5385, 4f3a7cf)
Addressed several issues in benchdnn and gtests (bb7bdb4, 0e04cc2, d7df8d2, a59354f)

Assets 2

17 Feb 00:02

anita-intel

v2.1

bd8a600

v2.1

Performance optimizations

Reduced overheads associated with primitive cache.
Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of Winograd convolution.
- Improved functionality performance for padded memory formats.
- Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
- Improved performance of pooling primitive for float16 data type.
- Improved performance of lnorm primitive for plain formats.
- Improved performance of resampling primitive for blocked formats.
Intel Architecture processors
- Introduced initial optimizations for bfloat16 functionality for future Intel Xeon Scalable processor with Intel AMX support (code name Sapphire Rapids).
- Improved performance of int8 and bfloat16 RNN and inner product primitives.
- Improved performance of shuffle primitive for bfloat16 data type.
- Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel AVX512 compute unit.
- Improved forward convolution performance for Intel AVX-512 systems.
- Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
- Improved convolution and batch normalization performance with threadpool.
AArch64-based processors
- Improved performance of Winograd convolution with ArmCL.
- Improved performance of int8 convolution with ArmCL.
- Added JIT support for Aarch64 and JIT implementations for reorder, eltwise, pooling, and batch normalization primitives.
NVIDIA GPUs
- (preview) Introduced support for NVIDIA GPU. The implementation relies on DPC++ Compiler, cuDNN, and cuBLAS libraries.

New Functionality

Introduced int8 support for LSTM primitive with projection for CPU.
Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs.
Extended the number of supported post-ops for primitives to 20.
Extended eltwise primitive with support for logsigmoid and clip_v2 algorithms.
Introduced support for PRelu primitive.
Extended matmul implementation with support for per-output channel zero-points for quantization.
Extended support for broadcasting in binary primitive to both inputs for CPU.
Introduced float16 support in reduction primitive for GPU.
Introduced support for mixed input and output types in binary primitive for GPU.

Usability

Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.

Validation

Extended benchdnn to report operation bandwidth.
Added ability to choose target GPU in benchdnn.

Thanks to the contributors

This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Benjamin Fitch, Ben Tracy @CodeplayBen, Daniel Soutar @danielsoutar, @dylan-angus-codeplay, Diana Bite @diaena, higuchi.motoko @higuchi-motoko, Jacob Kahn @jacobkahn, Kentaro Kawakami @kawakami-k, Kumudha KN @KumudhaN, kurihara @Koji-Kurihara, Mehdi Goli @mehdi-goli, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, Xinyu Chen @xinyu-intel, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.

Assets 12

22 Jan 19:17

anita-intel

v2.1-rc

1a15624

v2.1-rc Pre-release

Pre-release

This is a release candidate for oneDNN v2.1. Please provide feedback and report bugs in Github issues.

Assets 2

13 Jan 23:43

vpirogov

v1.8.1

2c8d206

v1.8.1

This is a patch release containing the following changes to v1.8:

Fixed performance regression for fp32 convolutions forward propagation on Intel Processor Graphics and Xe architecture-based Graphics (2c8d206, d8d6807)
Fixed segmentation fault for fp32 and bfloat16 convolutions with huge spatial dimensions on processors with Intel AVX2 and Intel AVX512 support (fe8487d, cb8ef4e)
Fixed correctness issue in depthwise convolution (groups = channels) weight gradient with non-trivial padding and strides on Intel64 processors (b7ffe48)
Fixed correctness issue in int8 convolution with 1x1 filter and non-trivial padding on Intel Processor Graphics and Xe architecture-based Graphics (5b4201c)
Fixed performance regression for dnnl_sgemm, fp32 matmul and inner product on Inte64 processors and improved this functionality performance with threadpool threading (32c1110)

Assets 2

05 Jan 18:41

vpirogov

v1.8

2d2f8b8

v1.8

Performance optimizations

Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of Winograd convolution.
Intel Architecture processors
- Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
- Improved performance of int8 and bfloat16 RNN and Inner Product primitives.
AArch64-based processors
- Improved performance of Winograd convolution with ArmCL
- Improved performance of int8 convolution with ArmCL
- Added JIT support for Aarch64 and JIT reorder implementation

New Functionality

Introduced int8 support for LSTM primitive with projection for CPU.

Thanks to the contributors

This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, Arthur Mitrano @aaraujom, Benjamin Fitch, Diana Bite @diaena, Kentaro Kawakami @kawakami-k, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.

Assets 10

01 Dec 18:29

anita-intel

v1.8-rc

c6a403c

v1.8-rc Pre-release

Pre-release

This is a release candidate for oneDNN v1.8. Please provide feedback and report bugs in Github issues.

Assets 2

09 Dec 00:02

anita-intel

v2.0

83ebc40

v2.0

This is a major oneDNN release based on oneDNN v1.7.

Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.

Breaking API changes

OpenCL API:
- OpenCL interoperability API moved to dnnl_ocl.hpp.
- Engine, stream, and memory are created from corresponding CL objects using free functions.
Threadpool
- Threadpool API is moved to dnnl_threadpool.hpp.
- Stream object for threadpool is created using free function dnnl::threadpool_interop::make_stream.
- Removed stream attributes.

New Functionality

Introduced SYCL API extensions compliant with oneAPI specification v1.0.
Introduced support for Intel(R) DPC++ Compiler and Level Zero runtime.
Introduced Unified Shared Memory (USM) support for Intel Processor Graphics and Xe architecture-based graphics.

Known Issues and Limitations

Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
o On Linux* (See more details at OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux):
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
o On Windows* (See more details at Timeout Detection and Recovery (TDR) Registry Keys):
Increase TdrDelay and TdrDdiDelay values in registry
See DPC++ limitations that impact the library as well.

Assets 12

03 Nov 21:27

anita-intel

v1.7

2e47326

v1.7

Performance optimizations

Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of convolutions and matmul primitives.
- Improved performance of int8 convolutions for NHWC activations format.
Intel Architecture processors:
- Improved performance of primitives for NHWC activations format.
- Improved fp32 GEMM performance for small N
- Improved performance of int8 primitives for processors with Intel SSE4.1 instruction set support.
AArch64-based processors
- Added support for Arm Performance Library (ArmPL). ArmPL provides optimized GEMM implementation for aarch64.
- Added support for Arm Compute Library (ArmCL). ArmCL provides optimized convolution implementation for aarch64.

New Functionality

Added support for IBMz (s390x) and IBM POWER (powerpc64) architectures
Introduced RNN GRU for GPU.
Introduced int8 RNN GRU for CPU
Introduced asymmetric quantization support for convolutions and matmul
Introduced dilated pooling support.
Extended matmul primitive to support multiple dimensions in batch and broadcast on CPU.
(preview) Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, and matmul.
(preview) Extended the number of supported post-ops for primitives to 20.
(preview) Introduced reduction primitive for CPU. Together with post-ops this functionality allows to implement normalization.

Thanks to the contributors

This release contains contributions from the project core team as well as Ben Fitch, Brian Shi, David Edelsohn @edelsohn, Diana Bite @diaena, Moaz Reyad @moazreyad, Nathan John Sircombe @nSircombe, Niels Dekker @N-Dekker, Peter Caday @petercad, Pinzhen Xu @pinzhenx, pkubaj @pkubaj, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.

Assets 10

Releases: uxlfoundation/oneDNN

v2.1.2

Uh oh!

v2.2-rc

Uh oh!

v2.1.1

Uh oh!

v2.1

Performance optimizations

New Functionality

Usability

Validation

Thanks to the contributors

Uh oh!

v2.1-rc

Uh oh!

v1.8.1

Uh oh!

v1.8

Performance optimizations

New Functionality

Thanks to the contributors

Uh oh!

v1.8-rc

Uh oh!

v2.0

Breaking API changes

New Functionality

Known Issues and Limitations

Uh oh!

v1.7

Performance optimizations

New Functionality

Thanks to the contributors

Uh oh!