-
Notifications
You must be signed in to change notification settings - Fork 802
Description
Request description
Objective
Implement MLIR-native ukernels for RISC-V RVV targets in IREE. This follows the GPU backend's precedent of using declarative, tensor-based ukernels. This design enables high-performance, target-specific code generation while preserving IREE's ability to perform automatic producer/consumer fusion (e.g., bias addition and activation) via standard compiler passes.
I plan to rely on the generic vector dialect for the implementation, rather than introducing a target-specific dialect, delegating instruction selection to the LLVM backend.
MVP Scope
The Minimum Viable Product will be a single, pure matmul ukernel for the f16 * f16 -> f32 mixed-precision case, targeting a vlen=128 RVV configuration with the zfh extension. This single ukernel will be used to demonstrate automatic fusion for a matmul+bias pattern.
Proposed Engineering Plan
1. Ensure Robust Lowering from vector Dialect to RVV in LLVM
We should rely on the vector dialect. The responsibility for generating optimal RVV instructions lies with the LLVM backend.
- Action: Verify that
vectordialect operations are correctly lowered to efficient RVV instructions by the MLIR-to-LLVM-IR conversion and LLVM's instruction selector. This includes:vector.fma %a, %b, %c->vfmacc.vf(for vector-scalar)vector.load,vector.store->vle16.v,vse32.vvector.broadcast->vmv.v.x
- Upstream Contribution: If gaps are found, the work involves improving the lowering patterns in upstream MLIR/LLVM, not creating a new dialect in IREE.
2. Implement IREE Infrastructure for CPU Ukernel Declaration
To match the GPU backend's capabilities, I need to add declarative attributes for CPU targets:
- Define a CPU Layout Attribute: Create an
#iree_cpu.data_tiled_layoutMLIR attribute to describe the optimal data tiling for a CPU ukernel. This tells IREE'sMaterializeDeviceEncodingpass how to prepare the data.// Example for our MVP ukernel #rvv_mvp_layout = #iree_cpu.data_tiled_layout< tile_sizes = [8, 8, 1], // M0=8, N0=8, K0=1 inner_blocking_order = [0, 1, 2] >
- Define a CPU Feature Matcher: Implement a
cpu_featureconstraint for the ukernel matching system. This allows a ukernel to be selected only when the target CPU has the required features (e.g.,+v,+zfh).
3. Define and Implement the RVV Matmul Ukernel in MLIR
It is a self-contained .mlir file that is both the specification and implementation of the kernel.
-
Create the
iree_uk_riscv_dt_matmul_f16f16f32.mlirfile with the following structure:// Ukernel for f16*f16->f32 on RVV. Bias is handled by compiler fusion. // --- Declarative Metadata --- #rvv_f16_layout = #iree_cpu.data_tiled_layout< tile_sizes = [8, 8, 1], inner_blocking_order = [0, 1, 2]> iree_codegen.ukernel.generic @riscv_matmul_8x8x1_f16f16f32 match="linalg.matmul" with operand_types<tensor<...xf16>, tensor<...xf16>, tensor<...xf32>> and cpu_feature<"+v", "+zfh"> provides layout<#rvv_f16_layout> { // --- Ukernel Implementation (using generic vector ops) --- util.func public @main( %lhs: tensor<8x1xf16>, %rhs: tensor<1x8xf16>, %init: tensor<8x8xf32>) -> tensor<8x8xf32> { %vtype_f16 = vector<8xf16> %vtype_f32 = vector<8xf32> // Load RHS vector (reused for all 8 rows of LHS). %rhs_vec = vector.load %vtype_f16, %rhs[0, 0] : tensor<1x8xf16>, vector<8xf16> // FUSION POINT: Load initial accumulators. The compiler will feed the // bias tensor into the '%init' argument. %acc0 = vector.load %vtype_f32, %init[0, 0] : tensor<8x8xf32>, vector<8xf32> %acc1 = vector.load %vtype_f32, %init[1, 0] : tensor<8x8xf32>, vector<8xf32> %acc2 = vector.load %vtype_f32, %init[2, 0] : tensor<8x8xf32>, vector<8xf32> %acc3 = vector.load %vtype_f32, %init[3, 0] : tensor<8x8xf32>, vector<8xf32> %acc4 = vector.load %vtype_f32, %init[4, 0] : tensor<8x8xf32>, vector<8xf32> %acc5 = vector.load %vtype_f32, %init[5, 0] : tensor<8x8xf32>, vector<8xf32> %acc6 = vector.load %vtype_f32, %init[6, 0] : tensor<8x8xf32>, vector<8xf32> %acc7 = vector.load %vtype_f32, %init[7, 0] : tensor<8x8xf32>, vector<8xf32> // Core matmul logic using generic vector FMA. %lhs_s0 = memref.load %lhs[0, 0] : tensor<8x1xf16> %b_lhs_s0 = vector.broadcast %lhs_s0 : f16 to vector<8xf16> %res0 = vector.fma %b_lhs_s0, %rhs_vec, %acc0 : vector<8xf32> %lhs_s1 = memref.load %lhs[1, 0] : tensor<8x1xf16> %b_lhs_s1 = vector.broadcast %lhs_s1 : f16 to vector<8xf16> %res1 = vector.fma %b_lhs_s1, %rhs_vec, %acc1 : vector<8xf32> // ... (repeated for res2 through res7) ... %lhs_s7 = memref.load %lhs[7, 0] : tensor<8x1xf16> %b_lhs_s7 = vector.broadcast %lhs_s7 : f16 to vector<8xf16> %res7 = vector.fma %b_lhs_s7, %rhs_vec, %acc7 : vector<8xf32> // Assemble final result tensor. %out0 = tensor.insert %res0 into %init[0, 0] : tensor<8x8xf32> %out1 = tensor.insert %res1 into %out0[1, 0] : tensor<8x8xf32> // ... (repeated for out2 through out7) ... %out7 = tensor.insert %res7 into %out6[7, 0] : tensor<8x8xf32> util.return %out7 : tensor<8x8xf32> } }
4. Integrate Ukernel into IREE's Build and Codegen Pipeline
- Place the ukernel
.mlirfile in a directory likeiree/compiler/Codegen/LLVMCPU/builtins/mlir_ukernel/riscv/. - Add a PDL pattern in
iree/compiler/Codegen/LLVMCPU/builtins/pdl/to match thelinalg.matmuland rewrite it using theiree_codegen.match.ukernelrewriter. - Ensure the
LowerTensorUKernelsPassis part of the LLVMCPU codegen pipeline and correctly finds and inlines the ukernel's@mainfunction.
5. Define a Comprehensive Test Suite
- Integration Test: Add an IREE test case that compiles a
linalg.matmulwith a precedinglinalg.fill(for bias). The test should target RVV and check the IR after theLowerTensorUKernelsPassto verify that the ukernel was inlined and that thelinalg.fillis correctly feeding its output into the ukernel'sinitargument.// test/iree/compiler/Codegen/LLVMCPU/riscv_matmul_bias_fusion.mlir // RUN: iree-compile ... --iree-llvmcpu-target-triple=riscv64... --iree-llvmcpu-target-cpu-features="+v,+zfh" ... // CHECK: func.func private @riscv_matmul_8x8x1_f16f16f32_dispatch // CHECK: vector.load %{{.*}} : memref<...>, vector<8xf32> func.func @test_matmul_bias(%lhs: tensor<512x256xf16>, %rhs: tensor<256x1024xf16>, %bias: tensor<1024xf32>) -> tensor<512x1024xf32> { %init = linalg.fill ins(%bias) outs(...) : tensor<512x1024xf32> %res = linalg.matmul ins(%lhs, %rhs) outs(%init) -> tensor<512x1024xf32> return %res : tensor<512x1024xf32> }
- End-to-End Test: Create an e2e test that runs the compiled module on an emulator (e.g., QEMU) or hardware and verifies the numerical correctness of the fused
matmul+biasoperation against a reference implementation.
What component(s) does this issue relate to?
MLIR
Additional context
No response