Xilinx · erwei-xilinx · Mar 21, 2026 · Mar 21, 2026
@@ -1,10 +1,75 @@
+<!-- This file is auto-generated by generate_readme.py. Do not edit manually. -->
+
 # MLIR-AIR Programming Examples
 
 These programming examples demonstrate how to leverage the AIR design flow with mlir-air Python bindings and the mlir-air intermediate representation (IR) to build applications targeting AI Engines on AMD NPUs.
 
 ## Operator Dashboard
 
-See the **[Operator Dashboard](https://xilinx.github.io/mlir-air/programming_examples/)** for the full table of supported operators with NPU1/NPU2 status indicators. The dashboard is auto-generated from LIT test files and published to GitHub Pages on every push to `main`.
+| Category | Operation | Datatype(s) | NPU1 | NPU2 | Design Example |
+|:---------|:----------|:------------|:----:|:----:|:---------------|
+| Linear Algebra | [Matrix Multiplication](matrix_multiplication/) | bf16, i16, i8 | 🟢 | 🟢 | [matrix_multiplication/](matrix_multiplication/) |
+| Linear Algebra | [Vector-Matrix Multiplication](vector_matrix_multiplication/) | bf16 | 🟢 | 🟢 | [vector_matrix_multiplication/](vector_matrix_multiplication/) |
+| Linear Algebra | [Matrix-Vector Multiplication](matrix_vector_multiplication/bf16/) | bf16 | ⚪ | 🟢 | [matrix_vector_multiplication/bf16/](matrix_vector_multiplication/bf16/) |
+| Linear Algebra | [AXPY](axpy/) | bf16 | 🟢 | 🟢 | [axpy/](axpy/) |
+| Element-wise | [Element-wise Add](eltwise_add/) | f32 | 🟢 | 🟢 | [eltwise_add/](eltwise_add/) |
+| Element-wise | [Element-wise Add (with L2)](eltwise_add_with_l2/) | f32 | 🟢 | 🟢 | [eltwise_add_with_l2/](eltwise_add_with_l2/) |
+| Element-wise | [Element-wise Add (bf16)](primitives/vector_examples/vector_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_add/](primitives/vector_examples/vector_add/) |
+| Element-wise | [Element-wise Mul](primitives/vector_examples/vector_mul/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_mul/](primitives/vector_examples/vector_mul/) |
+| Activation/Math | [SiLU](silu/) | bf16 | ⚪ | 🟢 | [silu/](silu/) |
+| Activation/Math | [GELU](gelu/) | bf16 | ⚪ | 🟢 | [gelu/](gelu/) |
+| Activation/Math | [Softmax](softmax/) | bf16 | 🟢 | 🟢 | [softmax/](softmax/) |
+| Activation/Math | [Sine / Cosine](sine_cosine/) | bf16 | 🟢 | ⚪ | [sine_cosine/](sine_cosine/) |
+| Activation/Math | [RELU](relu/) | bf16 | 🟢 | 🟢 | [relu/](relu/) |
+| Activation/Math | [Leaky RELU](leaky_relu/) | bf16 | 🟢 | 🟢 | [leaky_relu/](leaky_relu/) |
+| Activation/Math | [Sigmoid](sigmoid/) | bf16 | ⚪ | 🟢 | [sigmoid/](sigmoid/) |
+| Activation/Math | [Tanh](primitives/vector_examples/vector_tanh/) | bf16 | ⚪ | 🟢 | [primitives/vector_examples/vector_tanh/](primitives/vector_examples/vector_tanh/) |
+| Normalization | [Layer Normalization](layer_norm/) | bf16 | ⚪ | 🟢 | [layer_norm/](layer_norm/) |
+| Normalization | [RMS Normalization](rms_norm/) | bf16 | ⚪ | 🟢 | [rms_norm/](rms_norm/) |
+| Normalization | [Weighted RMS Normalization](weighted_rms_norm/) | bf16 | ⚪ | 🟢 | [weighted_rms_norm/](weighted_rms_norm/) |
+| Aggregation | [Reduction (Add)](primitives/vector_examples/vector_reduce_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_add/](primitives/vector_examples/vector_reduce_add/) |
+| Pooling | [MaxPool](primitives/vector_examples/vector_reduce_max/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_max/](primitives/vector_examples/vector_reduce_max/) |
+| Pooling | [AveragePool](average_pool/) | bf16 | 🟢 | 🟢 | [average_pool/](average_pool/) |
+| LLM Kernels | [Multi-Head Attention (LLaMA2)](llama2_mha/) | bf16 | 🟢 | ⚪ | [llama2_mha/](llama2_mha/) |
+| LLM Kernels | [SwiGLU](swiglu/) | bf16 | ⚪ | 🟢 | [swiglu/](swiglu/) |
+| LLM Kernels | [FFN SwiGLU (Decode)](ffn_swiglu/decode/) | bf16 | ⚪ | 🟢 | [ffn_swiglu/decode/](ffn_swiglu/decode/) |
+| LLM Kernels | [FFN SwiGLU (Prefill)](ffn_swiglu/prefill/) | bf16 | ⚪ | 🟢 | [ffn_swiglu/prefill/](ffn_swiglu/prefill/) |
+| LLM Kernels | [RoPE (LUT-based)](rope_lut/) | bf16 | ⚪ | 🟢 | [rope_lut/](rope_lut/) |
+| LLM Kernels | [RoPE (On-chip Sin/Cos)](rope_sincos/) | bf16 | 🟢 | 🟢 | [rope_sincos/](rope_sincos/) |
+| Attention | [Flash Attention (Dataflow)](flash_attention/dataflow_based/) | bf16 | 🟢 | 🟢 | [flash_attention/dataflow_based/](flash_attention/dataflow_based/) |
+| Attention | [Flash Attention (Kernel Fusion)](flash_attention/kernel_fusion_based/) | bf16 | ⚪ | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) |
+| Attention | [Grouped Query Attention (GQA)](flash_attention/kernel_fusion_based/) | bf16 | ⚪ | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) |
+| Data Movement | [Passthrough (DMA)](passthrough/passthrough_dma/) | u8, i8, i16, u16, f32, bf16 | 🟢 | 🟢 | [passthrough/passthrough_dma/](passthrough/passthrough_dma/) |
+| Data Movement | [Passthrough (Channel)](passthrough/passthrough_channel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_channel/](passthrough/passthrough_channel/) |
+| Data Movement | [Passthrough (Kernel)](passthrough/passthrough_kernel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_kernel/](passthrough/passthrough_kernel/) |
+| Data Movement | [Shim DMA 2D](shim_dma_2d/) | i32 | 🟢 | 🟢 | [shim_dma_2d/](shim_dma_2d/) |
+| Data Movement | [Data Transfer Transpose](data_transfer_transpose/) | u32 | 🟢 | 🟢 | [data_transfer_transpose/](data_transfer_transpose/) |
+| Data Movement | [Transpose (bf16)](data_transfer_transpose/dma_bf16/) | bf16 | ⚪ | 🟢 | [data_transfer_transpose/dma_bf16/](data_transfer_transpose/dma_bf16/) |
+| Data Movement | [Matrix Scalar Add](matrix_scalar_add/) | i32 | 🟢 | 🟢 | [matrix_scalar_add/](matrix_scalar_add/) |
+| Communication | [Channel Examples](channel_examples/) | i32 | 🟢 | 🟢 | [channel_examples/](channel_examples/) |
+| Communication | [Multi-Segment Examples](multi_segment/) | i32 | 🟡 | 🟡 | [multi_segment/](multi_segment/) |
+| Communication | [Cascade Reduction](cascade_reduction/) | i32 | 🟢 | 🟢 | [cascade_reduction/](cascade_reduction/) |
+| Memory | [Segment Alloc](segment_alloc/) | i32 | 🟢 | 🟢 | [segment_alloc/](segment_alloc/) |
+| Spatial | [Segment Unroll](segment_unroll/) | i32 | 🟢 | 🟢 | [segment_unroll/](segment_unroll/) |
+| Dataflow | [Herd Dataflow](herd_dataflow/) | bf16 | 🟢 | 🟢 | [herd_dataflow/](herd_dataflow/) |
+| Control Flow | [Conditional Branching](conditional_branching/) | i32 | 🟢 | 🟢 | [conditional_branching/](conditional_branching/) |
+| CNN | [2D Convolution](conv2d/) | i32 | 🟢 | 🟢 | [conv2d/](conv2d/) |
+| CNN | [Bottleneck](bottleneck/) | bf16 | 🟢 | 🟢 | [bottleneck/](bottleneck/) |
+| ML Pipeline | [MNIST-FC (Broadcast Bias Add)](mnist_fc/broadcast_bias_add/) | f32 | ⚪ | 🟢 | [mnist_fc/broadcast_bias_add/](mnist_fc/broadcast_bias_add/) |
+| ML Pipeline | [MNIST-FC (ReLU 2D)](mnist_fc/relu/) | f32/bf16 | ⚪ | 🟢 | [mnist_fc/relu/](mnist_fc/relu/) |
+| ML Pipeline | [MNIST-FC (Argmax)](mnist_fc/argmax/) | f32→i32 | ⚪ | 🟢 | [mnist_fc/argmax/](mnist_fc/argmax/) |
+| ML Pipeline | [MNIST-FC (Integration)](mnist_fc/integration/) | f32 | ⚪ | 🟢 | [mnist_fc/integration/](mnist_fc/integration/) |
+| Memory | [Shared L1 Buffer](shared_l1/) | bf16 | 🟢 | ⚪ | [shared_l1/](shared_l1/) |
+| Quantization | [Dequant (AWQ int4→bf16)](dequant_awq/) | int4/bf16 | ⚪ | 🟢 | [dequant_awq/](dequant_awq/) |
+| Primitives | [Scalar/Vector Operations](primitives/) | various | 🟢 | 🟢 | [primitives/](primitives/) |
+
+### Status Legend
+
+- 🟢 Supported and tested
+- 🟡 Work in progress
+- ⚪ Not yet supported
+
+**NPU1** = AMD Ryzen AI (Phoenix, AIE2) &nbsp;&nbsp; **NPU2** = AMD Ryzen AI (Strix, AIE2P)
 
 ## Getting Started
 

@@ -312,6 +312,30 @@
         "path": "bottleneck",
         "datatypes": "bf16",
     },
+    {
+        "category": "ML Pipeline",
+        "name": "MNIST-FC (Broadcast Bias Add)",
+        "path": "mnist_fc/broadcast_bias_add",
+        "datatypes": "f32",
+    },
+    {
+        "category": "ML Pipeline",
+        "name": "MNIST-FC (ReLU 2D)",
+        "path": "mnist_fc/relu",
+        "datatypes": "f32/bf16",
+    },
+    {
+        "category": "ML Pipeline",
+        "name": "MNIST-FC (Argmax)",
+        "path": "mnist_fc/argmax",
+        "datatypes": "f32\u2192i32",
+    },
+    {
+        "category": "ML Pipeline",
+        "name": "MNIST-FC (Integration)",
+        "path": "mnist_fc/integration",
+        "datatypes": "f32",
+    },
     {
         "category": "Memory",
         "name": "Shared L1 Buffer",

@@ -0,0 +1,24 @@
+# Copyright (C) 2026, Advanced Micro Devices, Inc.
+# SPDX-License-Identifier: MIT
+srcdir := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))
+
+ifdef PEANO_INSTALL_DIR
+  BUILD_DIR := build_peano
+else
+  BUILD_DIR := build_chess
+endif
+
+NE0 ?= 10
+NE1 ?= 500
+
+all: run
+
+print:
+	${powershell} python3 ${srcdir}/run.py --ne0 $(NE0) --ne1 $(NE1) -p
+
+run:
+	mkdir -p $(BUILD_DIR)
+	PEANO_INSTALL_DIR=$(PEANO_INSTALL_DIR) cd $(BUILD_DIR) && ${powershell} python3 ${srcdir}/run.py --ne0 $(NE0) --ne1 $(NE1) -v
+
+clean:
+	rm -rf $(BUILD_DIR) __pycache__