Skip to content

Releases: uxlfoundation/oneDNN

v2.7.3

13 Jan 01:03

Choose a tag to compare

This is a patch release containing the following changes to v2.7.2:

  • Fixed segfault in int8 convolution with binary post-ops on Intel CPUs (c8d40c0)
  • Applied workaround for tanh post-op on some Xe architecture based GPUs (3eb3267)
  • Disabled fp16 post-ops with Compute Library for Arm Architecture (ACL) (f7b7dc0)
  • Fixed incorrect results for sequence of eltwise post-op with same algorithm but different parameters (02c2678, 1c36e27, 81ba0fe)
  • Fixed issue in convolution with groups and plain activation layout on Intel GPUs (df6f2e3, d0c14c2)
  • Fixed reorder failures on Xe HPC architecture based GPUs (c3cb1d5)
  • Fixed thread safety issue in convolution primitive (2955c9d)
  • Fixed scratchpad allocation issue in matmul (989acd3)
  • Disabled concat batching with scales on Intel GPUs since implementation doesn't support it yet (8aab73f, 1eac450, 82838de)
  • Fixed segfault and correctness issues in convolution primitive with sum and relu post-ops on Intel CPUs (fc335be, 0f4697a, 60f1727, d28f2c1, 4761ee9, f674fbf)

v3.0

20 Dec 00:15

Choose a tag to compare

Performance Optimizations

  • Intel Architecture Processors:
    • Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
    • Introduced FP16 support and initial optimizations for future Intel Xeon Scalable processor (code name Granite Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
  • Intel Graphics Products:
    • Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
  • AArch64-based Processors:
    • Improved reorder performance for processors with Scalable Vector Extensions (SVE) support.
    • Improved pooling performance with post-ops for processors with SVE 512 support.
    • Improved batch normalization performance with non-default flags for processors with SVE 512 support.
    • Improved performance of FP16 functionality with Compute Library for Arm Architecture (ACL).
    • Improved deconvolution performance with ACL.
  • PowerPC64-based Processors:
    • Improved int8 GEMM performance.

Functionality

  • Introduced new quantization scheme. Major changes include support for per-argument runtime scales in all primitives and unquantized bias.
  • [experimental] Introduced Graph API support that simplifies oneDNN integration into applications. The functionality is disabled by default and can be enabled at build time with ONEDNN_BUILD_GRAPH=ON flag.
  • Introduced support for Intel DPC++/C++ Compiler 2023.0, including new features from the SYCL 2020 standard.
  • Extended persistent cache to cover GPU engine object. This improvement allows applications to further reduce oneDNN initialization time.
  • Extended threadpool API with a function to indicate maximum available concurrency.
  • Extended binary primitive implementation on GPU with bfloat16 source and int8 destination support.
  • Introduced pooling and reduction primitives support on AMD GPUs.
  • Introduced reduction primitive support on NVIDIA GPUs.

Usability

  • Extended the set of supported format tags to cover formats used in applications.

Validation

  • Extended the GoogleTest (gtest) suite with support for Parametric Rectified Linear Unit (PReLU) primitive.

Breaking Changes

  • Removed deprecated APIs.
  • Removed operation descriptor object and made memory descriptor object opaque. See details in operation and memory descriptors RFC.
  • Removed creation time primitive scales support and primitive output scales support. See details in quantization scaling RFC.
  • Removed support for Intel DPC++/C++ Compiler 2022 and SYCL 1.2.1 (aka SYCL 2017) standard support. Use Intel DPC++/C++ Compiler and SYCL 2020 standard instead.
  • Removed Winograd convolution implementation for int8 data type.
  • Updated minimal supported ACL version to 22.08 (was 22.05).

Thanks to the Contributors

This release contains contributions from the project core team as well as @akshatasangelkar, Aryan Karumuri @AryanKarumuri, Crefeda Rodrigues @cfRod, Divakar Mariyanna @bmdivakar, Gordon Fossum @austinpagan, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, lilianhuang @lilh9598, Milos Puzovic @milpuz01, Mona Minakshi @monaminakshi, Nathan John Sircombe @nSircombe, Peter Caday @petercad, and Sreekanth Yalachigere @sreekanth-yalachigere. We would also like to thank everyone who asked questions and reported issues.

graph-v0.8

22 Dec 17:19

Choose a tag to compare

graph-v0.8 Pre-release
Pre-release

This is the Beta Update 2 release of oneDNN Graph API based on oneDNN v2.7.2.

Functionality

  • Added HardSigmoid operation.
  • Added block tensor layout support to improve performance on Xe architecture-based GPUs.
  • Added support of IOX and XOI weight formats for ConvTranspose operation.
  • Added query_dynamic_outputs API to support dynamic shapes in the graph. This functionality allows Graph API to infer output tensors shapes based on input tensors.
  • Experimental: Introduced dynamic shapes support for MHA via oneDNN Graph Compiler.

Known Issues and Limitations

  • The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
  • MHA and MLP fusion are not activated on machines without Intel AVX-512 support.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

v3.0-rc

02 Dec 22:54

Choose a tag to compare

v3.0-rc Pre-release
Pre-release

This is a release candidate for oneDNN v3.0. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

  • Intel Architecture Processors:
    • Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
    • Introduced FP16 support and initial optimizations for future Intel Xeon Scalable processor (code name Granite Rapids).
  • Intel Graphics Products:
    • Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
  • AArch64-based Processors:
    • Improved reorder performance for processors with Scalable Vector Extensions (SVE) support.
    • Improved pooling performance with post-ops for processors with SVE 512 support.
    • Improved batch normalization performance with non-default flags for processors with SVE 512 support.
    • Improved performance of FP16 functionality with Compute Library for Arm Architecture (ACL).
    • Improved deconvolution performance with ACL.
  • PowerPC64-based Processors:
    • Improved int8 GEMM performance.

Functionality

  • Introduced new quantization scheme. Major changes include support for per-argument runtime scales in all primitives and unquantized bias.
  • [experimental] Introduced Graph API support that simplifies oneDNN integration into applications. The functionality is disabled by default and can be enabled at build time with ONEDNN_BUILD_GRAPH=ON flag.
  • Introduced support for Intel DPC++/C++ Compiler 2023.0, including new features from the SYCL 2020 standard.
  • Extended persistent cache to cover GPU engine object. This improvement allows applications to further reduce oneDNN initialization time.
  • Extended threadpool API with a function to indicate maximum available concurrency.
  • Extended binary primitive implementation on GPU with bfloat16 source and int8 destination support.
  • Introduced pooling and reduction primitives support on AMD GPUs.
  • Introduced reduction primitive support on NVIDIA GPUs.

Usability

  • Extended the set of supported format tags to cover formats used in applications.

Validation

  • Extended the GoogleTest (gtest) suite with support for Parametric Rectified Linear Unit (PReLU) primitive.

Breaking Changes

  • Removed deprecated APIs.
  • Removed operation descriptor object and made memory descriptor object opaque. See details in operation and memory descriptors RFC.
  • Removed creation time primitive scales support and primitive output scales support. See details in quantization scaling RFC.
  • Removed support for Intel DPC++/C++ Compiler with SYCL 1.2.1 (aka SYCL 2017) standard.
  • Removed Winograd convolution implementation for int8 data type.
  • Updated minimal supported ACL version to 22.08 (was 22.05).

Thanks to the Contributors

This release contains contributions from the project core team as well as @akshatasangelkar, Aryan Karumuri @AryanKarumuri, Crefeda Rodrigues @cfRod, Divakar Mariyanna @bmdivakar, Gordon Fossum @austinpagan, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, lilianhuang @lilh9598, Milos Puzovic @milpuz01, Mona Minakshi @monaminakshi, Nathan John Sircombe @nSircombe, Peter Caday @petercad, and Sreekanth Yalachigere @sreekanth-yalachigere. We would also like to thank everyone who asked questions and reported issues.

graph-v0.7.2

01 Dec 20:28

Choose a tag to compare

graph-v0.7.2 Pre-release
Pre-release

This is a patch release containing the following changes to graph-v0.7.1:

v2.7.2

19 Nov 00:47

Choose a tag to compare

This is a patch release containing the following changes to v2.7.1:

  • Fixed segfaults in deconvolution backpropagation with ACL on AArch64-based processors (f02e6f3)
  • Fixed code generation issues in Intel AVX2 convolution implementation (2ba2523, b60633f, 844326b, 2009164)
  • Fixed correcteness issues and runtime errors in deconvolution with binary post-ops on Intel GPUs (dd54d39)
  • Improved performance of convolutions with small number of channels and large spatial sizes on systems with Intel AMX (26f97dc, 4cb648d)
  • Fixed runtime error in int8 convolutions with groups on Xe architecture based GPUs (e5a70f4)
  • Improved inner product weight gradient performance on Xe architecture based GPUs (9e9b859, 12ec4e3)
  • Improved batch normalization performance with threadpool threading (4fd5ab2)
  • Improved inner product performance with binary post-ops in broadcast mode on Intel CPUs (d43c70d, 49ca4e1)
  • Fixed segfaults and correctness issues in sum primitive with threadpool threading (ee7a321)
  • Extended persistent cache API to cover engine objects (58481d6, 5f69dad, 16c0a95, 068071b)
  • Added support for newer versions of Intel GPU drivers (7144393)
  • Updated ITT API version to 3.23.0 (d23cc95)
  • Fixed convolution correctness issue on Intel Data Center GPU Flex Series (365ac20)
  • Fixed fp64 convolution correctness issue on Intel Data Center GPU MAX Series (9d4bf94, 6705403)
  • Fixed correctness issues in reduction primitive with binary post-op on Intel GPUs (ae9d075, e3b80c5)
  • Improved convolution performance on on Intel Data Center GPU MAX Series (90be8d5, caf4863)
  • Fixed build errors with ONEDNN_ENABLE_PRIMITIVE_GPU_ISA build option (de2db04)
  • Fixed correctness issues in convolution with per-tensor binary post-ops on Intel CPUs (9cf9c18)
  • Improved convolution performance on Intel Data Center GPU Flex Series (8b08a07)

graph-v0.7.1

09 Nov 20:34

Choose a tag to compare

graph-v0.7.1 Pre-release
Pre-release

This is a patch release containing the following changes to graph-v0.7:

  • Fixed a build issue in compiler backend (70258d3)
  • Optimized for zero points folding (d6f12b5)
  • Fixed a primitive descriptor cache issue in reorder fusion (0887652)

v2.7.1

21 Oct 22:48

Choose a tag to compare

This is a patch release containing the following changes to v2.7:

  • Fixed performance regression for batch normalization primitive in TBB and threadpool configurations (cd953e4)
  • Improved grouped convolution performance on Xe Architecture GPUs (d7a781e, cb1f3fe, 4e84474, 7ba3c40)
  • Fixed runtime error in int8 reorder on Intel GPUs (53532a9)
  • Reverted MEMFD allocator in Xbyak to avoid segfaults in high load scenarios (3e29ae2)
  • Fixed a defect with incorrect caching of BRGEMM-based matmul primitive implementations with trivial dimensions (87cd979)
  • Improved depthwise convolution performance with per-tensor binary post-ops for Intel CPUs (f430a5a)
  • Extended threadpool API to manage maximum concurrency (8a1e959, 64e5594)
  • Fixed potential integer overflow in BRGEMM-based convolution implementation (25ccee3)
  • Fixed performance regression in concat primitive with any format on Intel CPUs (2a60ade, feb614d)
  • Fixed compile-time warnings in matmul_perf example (b5faa77)
  • Fixed 'insufficient registers in requested bundle' runtime error in convolution primitive on Xe Architecture GPUs (4c9d46a)
  • Addressed performance regression for certain convolution cases on Xe Architecture GPUs (f28b58a, 18764fb)
  • Added support for Intel DPC++/C++ Compiler 2023 (c3781c6, a1a8952, 9bc87e6, e3b1987)
  • Fixed int8 matmul and inner product performance regression on Xe Architecture GPUs (3693fbf, c8adc17)
  • Fixed accuracy issue for convolution, inner product and matmul primitives with tanh post-op on Xe Architecture GPUs (88b4e57, 83ce6d2, 6224dc6, 10f0d0a)
  • Suppressed spurious build warnings with GCC 11 (44255a8)

v2.6.3

21 Oct 19:37

Choose a tag to compare

This is a patch release containing the following changes to v2.6.2:

  • Fixed potential integer overflow in BRGEMM-based convolution implementation (deb5595)
  • Fixed a defect with incorrect caching of BRGEMM-based matmul primitive implementations with trivial dimensions (305bed5)
  • Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (ba86325)
  • Fixed segfault in pooling primitive on CPUs (689d874)

graph-v0.7

14 Oct 20:21

Choose a tag to compare

graph-v0.7 Pre-release
Pre-release

This is the Beta Update release for oneDNN Graph API based on oneDNN v2.7 release.

Functionality

  • Added operations Select, LogicalAnd, LogicalOr, LogicalXor, LogicalNot, Greater, GreaterEqual, Equal, NoeEqual, Less, and LessEqual.
  • Added boolean data type to support logical operations.
  • Added support for passing compilation context to the compile API. This feature allows passing additional information, like tensor shape context, for the backend to generate better kernel code.
  • Introduced convolution block fusion via oneDNN Graph Compiler.
  • Experimental: Introduced dynamic shapes support for multi-level perceptron (MLP) block via oneDNN Graph Compiler.

Known Issues and Limitations

  • The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
  • MHA and MLP fusion are not activated on machines without Intel AVX-512 support.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.