These programming examples provide a good starting point to illustrate how to build commonly used compute kernels (both single-core and multi-core data processing pipelines). They serve to highlight how designs can be described in Python and lowered through the mlir-aie tool flow to an executable that runs on the NPU. Passthrough Kernel and Vector Scalar Mul are good designs to get started with. Please see section 3 of the programming guide for a more detailed guide on developing designs.
- Passthrough DMAs - This design demonstrates data movement to implement a memcpy operation using object FIFOs just using DMAs without involving the AIE core.
- Passthrough Kernel - This design demonstrates a simple AIE implementation for vectorized memcpy on a vector of integer involving AIE core kernel programming.
- DMA Transpose - Transposes a matrix with the Shim DMA using
npu_dma_memcpy_nd
- Vector Scalar Add - Single tile performs a very simple
+
operation where the kernel loads data from local memory, increments the value by1
and stores it back. - Vector Scalar Mul - Single tile performs
vector * scalar
of size4096
. The kernel does a1024
vector multiply and is invoked multiple times to complete the fullvector * scalar
compute. - Vector Vector Add - Single tile performs
vector + vector
of size1024
. - Vector Vector Modulo - Single tile performs
vector % vector
of size1024
. - Vector Vector Multiply - Single tile performs
vector * vector
of size1024
. - Vector Reduce Add - Single tile performs a reduction of a vector to return the
sum
of the elements. - Vector Reduce Max - Single tile performs a reduction of a vector to return the
max
of the elements. - Vector Reduce Min - Single tile performs a reduction of a vector to return the
min
of the elements. - Vector Exp - A simple element-wise exponent function, using the look up table capabilities of the AI Engine.
- Matrix Scalar Add - Single tile performs
matrix * vector
with matrix size of16x8
. - Matrix Multiplication - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking.