Skip to content

[MLIR] [uKernel] Implement MLIR-based Ukernels for RISC-V Vector (RVV) Extension #22720

@copparihollmann

Description

@copparihollmann

Request description

Objective

Implement MLIR-native ukernels for RISC-V RVV targets in IREE. This follows the GPU backend's precedent of using declarative, tensor-based ukernels. This design enables high-performance, target-specific code generation while preserving IREE's ability to perform automatic producer/consumer fusion (e.g., bias addition and activation) via standard compiler passes.

I plan to rely on the generic vector dialect for the implementation, rather than introducing a target-specific dialect, delegating instruction selection to the LLVM backend.

MVP Scope

The Minimum Viable Product will be a single, pure matmul ukernel for the f16 * f16 -> f32 mixed-precision case, targeting a vlen=128 RVV configuration with the zfh extension. This single ukernel will be used to demonstrate automatic fusion for a matmul+bias pattern.

Proposed Engineering Plan

1. Ensure Robust Lowering from vector Dialect to RVV in LLVM

We should rely on the vector dialect. The responsibility for generating optimal RVV instructions lies with the LLVM backend.

  • Action: Verify that vector dialect operations are correctly lowered to efficient RVV instructions by the MLIR-to-LLVM-IR conversion and LLVM's instruction selector. This includes:
    • vector.fma %a, %b, %c -> vfmacc.vf (for vector-scalar)
    • vector.load, vector.store -> vle16.v, vse32.v
    • vector.broadcast -> vmv.v.x
  • Upstream Contribution: If gaps are found, the work involves improving the lowering patterns in upstream MLIR/LLVM, not creating a new dialect in IREE.

2. Implement IREE Infrastructure for CPU Ukernel Declaration

To match the GPU backend's capabilities, I need to add declarative attributes for CPU targets:

  1. Define a CPU Layout Attribute: Create an #iree_cpu.data_tiled_layout MLIR attribute to describe the optimal data tiling for a CPU ukernel. This tells IREE's MaterializeDeviceEncoding pass how to prepare the data.
    // Example for our MVP ukernel
    #rvv_mvp_layout = #iree_cpu.data_tiled_layout<
      tile_sizes = [8, 8, 1], // M0=8, N0=8, K0=1
      inner_blocking_order = [0, 1, 2]
    >
  2. Define a CPU Feature Matcher: Implement a cpu_feature constraint for the ukernel matching system. This allows a ukernel to be selected only when the target CPU has the required features (e.g., +v, +zfh).

3. Define and Implement the RVV Matmul Ukernel in MLIR

It is a self-contained .mlir file that is both the specification and implementation of the kernel.

  • Create the iree_uk_riscv_dt_matmul_f16f16f32.mlir file with the following structure:

    // Ukernel for f16*f16->f32 on RVV. Bias is handled by compiler fusion.
    
    // --- Declarative Metadata ---
    #rvv_f16_layout = #iree_cpu.data_tiled_layout<
      tile_sizes = [8, 8, 1], inner_blocking_order = [0, 1, 2]>
    
    iree_codegen.ukernel.generic @riscv_matmul_8x8x1_f16f16f32
        match="linalg.matmul"
        with
          operand_types<tensor<...xf16>, tensor<...xf16>, tensor<...xf32>> and
          cpu_feature<"+v", "+zfh">
        provides layout<#rvv_f16_layout>
    {
      // --- Ukernel Implementation (using generic vector ops) ---
      util.func public @main(
          %lhs: tensor<8x1xf16>, %rhs: tensor<1x8xf16>, %init: tensor<8x8xf32>) -> tensor<8x8xf32>
      {
        %vtype_f16 = vector<8xf16>
        %vtype_f32 = vector<8xf32>
    
        // Load RHS vector (reused for all 8 rows of LHS).
        %rhs_vec = vector.load %vtype_f16, %rhs[0, 0] : tensor<1x8xf16>, vector<8xf16>
    
        // FUSION POINT: Load initial accumulators. The compiler will feed the
        // bias tensor into the '%init' argument.
        %acc0 = vector.load %vtype_f32, %init[0, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc1 = vector.load %vtype_f32, %init[1, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc2 = vector.load %vtype_f32, %init[2, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc3 = vector.load %vtype_f32, %init[3, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc4 = vector.load %vtype_f32, %init[4, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc5 = vector.load %vtype_f32, %init[5, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc6 = vector.load %vtype_f32, %init[6, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc7 = vector.load %vtype_f32, %init[7, 0] : tensor<8x8xf32>, vector<8xf32>
    
        // Core matmul logic using generic vector FMA.
        %lhs_s0 = memref.load %lhs[0, 0] : tensor<8x1xf16>
        %b_lhs_s0 = vector.broadcast %lhs_s0 : f16 to vector<8xf16>
        %res0 = vector.fma %b_lhs_s0, %rhs_vec, %acc0 : vector<8xf32>
    
        %lhs_s1 = memref.load %lhs[1, 0] : tensor<8x1xf16>
        %b_lhs_s1 = vector.broadcast %lhs_s1 : f16 to vector<8xf16>
        %res1 = vector.fma %b_lhs_s1, %rhs_vec, %acc1 : vector<8xf32>
    
        // ... (repeated for res2 through res7) ...
    
        %lhs_s7 = memref.load %lhs[7, 0] : tensor<8x1xf16>
        %b_lhs_s7 = vector.broadcast %lhs_s7 : f16 to vector<8xf16>
        %res7 = vector.fma %b_lhs_s7, %rhs_vec, %acc7 : vector<8xf32>
    
        // Assemble final result tensor.
        %out0 = tensor.insert %res0 into %init[0, 0] : tensor<8x8xf32>
        %out1 = tensor.insert %res1 into %out0[1, 0] : tensor<8x8xf32>
        // ... (repeated for out2 through out7) ...
        %out7 = tensor.insert %res7 into %out6[7, 0] : tensor<8x8xf32>
    
        util.return %out7 : tensor<8x8xf32>
      }
    }

4. Integrate Ukernel into IREE's Build and Codegen Pipeline

  1. Place the ukernel .mlir file in a directory like iree/compiler/Codegen/LLVMCPU/builtins/mlir_ukernel/riscv/.
  2. Add a PDL pattern in iree/compiler/Codegen/LLVMCPU/builtins/pdl/ to match the linalg.matmul and rewrite it using the iree_codegen.match.ukernel rewriter.
  3. Ensure the LowerTensorUKernelsPass is part of the LLVMCPU codegen pipeline and correctly finds and inlines the ukernel's @main function.

5. Define a Comprehensive Test Suite

  1. Integration Test: Add an IREE test case that compiles a linalg.matmul with a preceding linalg.fill (for bias). The test should target RVV and check the IR after the LowerTensorUKernelsPass to verify that the ukernel was inlined and that the linalg.fill is correctly feeding its output into the ukernel's init argument.
    // test/iree/compiler/Codegen/LLVMCPU/riscv_matmul_bias_fusion.mlir
    // RUN: iree-compile ... --iree-llvmcpu-target-triple=riscv64... --iree-llvmcpu-target-cpu-features="+v,+zfh" ...
    // CHECK: func.func private @riscv_matmul_8x8x1_f16f16f32_dispatch
    // CHECK: vector.load %{{.*}} : memref<...>, vector<8xf32>
    
    func.func @test_matmul_bias(%lhs: tensor<512x256xf16>, %rhs: tensor<256x1024xf16>, %bias: tensor<1024xf32>) -> tensor<512x1024xf32> {
      %init = linalg.fill ins(%bias) outs(...) : tensor<512x1024xf32>
      %res = linalg.matmul ins(%lhs, %rhs) outs(%init) -> tensor<512x1024xf32>
      return %res : tensor<512x1024xf32>
    }
  2. End-to-End Test: Create an e2e test that runs the compiled module on an emulator (e.g., QEMU) or hardware and verifies the numerical correctness of the fused matmul+bias operation against a reference implementation.

What component(s) does this issue relate to?

MLIR

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions