v1.2
Performance optimizations
- Improved 1D backward convolution performance on CPU.
- Improved int8 inference performance on pre-Intel AVX512 systems.
- Improved int8 inference performance for 3D spatial data on CPU.
- Improved performance of convolution and other primitives on GPU.
New functionality
- Introduced general purpose matrix-matrix multiplication primitive. The functionality supports fp32, bfloat16, and int8 data types with asymmetric quantization.
- Introduced logsoftmax and resampling primitives.
- Introduced clip and log algorithms support in elementwise primitive.
- Introduced int8 and bf16 data types support for binary primitive (CPU only).
- Introduced fully functional support of int8 (inference) and bfloat16 (inference and training) datatypes on GPU. The functionality is not intended for getting performance improvement over f32 on current Intel Integrated Graphics, but to make conformance experiments.
Usability improvements
- Added JIT code annotations for linux-perf profiler.
- Added mechanism to control CPU dispatcher behavior at runtime via DNNL_MAX_CPU_ISA environment variable or a function call.
- Extended DNNL_VERBOSE output with more information about runtimes and devices.
Thanks to the contributors
This release contains contributions from the project core team as well as Aaron Johnson @aaronjohnson, Attila T. Áfra @atafra, Ben Fitch, Ilya Taraban @itaraban, Michał Gallus @Sand3r-, Peter Caday @petercad, Qiyou Chen @chenqy4933 and Jun Luan @junluan. We would also like to thank everyone who asked questions and reported issues.