[MLIR] [uKernel] Implement MLIR-based Ukernels for RISC-V Vector (RVV) Extension

### Request description

## Objective

Implement MLIR-native ukernels for RISC-V RVV targets in IREE. This follows the GPU backend's precedent of using declarative, tensor-based ukernels. This design enables high-performance, target-specific code generation while preserving IREE's ability to perform automatic producer/consumer fusion (e.g., bias addition and activation) via standard compiler passes.

I plan to rely on the generic vector dialect for the implementation, rather than introducing a target-specific dialect, delegating instruction selection to the LLVM backend.

## MVP Scope

The Minimum Viable Product will be a single, pure `matmul` ukernel for the `f16 * f16 -> f32` mixed-precision case, targeting a `vlen=128` RVV configuration with the `zfh` extension. This single ukernel will be used to demonstrate automatic fusion for a `matmul+bias` pattern.

## Proposed Engineering Plan

### 1. Ensure Robust Lowering from `vector` Dialect to RVV in LLVM

We should rely on the `vector` dialect. The responsibility for generating optimal RVV instructions lies with the LLVM backend.

*   **Action:** Verify that `vector` dialect operations are correctly lowered to efficient RVV instructions by the MLIR-to-LLVM-IR conversion and LLVM's instruction selector. This includes:
    *   `vector.fma %a, %b, %c` -> `vfmacc.vf` (for vector-scalar)
    *   `vector.load`, `vector.store` -> `vle16.v`, `vse32.v`
    *   `vector.broadcast` -> `vmv.v.x`
*   **Upstream Contribution:** If gaps are found, the work involves improving the lowering patterns in upstream MLIR/LLVM, not creating a new dialect in IREE.

### 2. Implement IREE Infrastructure for CPU Ukernel Declaration

To match the GPU backend's capabilities, I need to add declarative attributes for CPU targets:


1.  **Define a CPU Layout Attribute:** Create an `#iree_cpu.data_tiled_layout` MLIR attribute to describe the optimal data tiling for a CPU ukernel. This tells IREE's `MaterializeDeviceEncoding` pass how to prepare the data.
    ```mlir
    // Example for our MVP ukernel
    #rvv_mvp_layout = #iree_cpu.data_tiled_layout<
      tile_sizes = [8, 8, 1], // M0=8, N0=8, K0=1
      inner_blocking_order = [0, 1, 2]
    >
    ```
2.  **Define a CPU Feature Matcher:** Implement a `cpu_feature` constraint for the ukernel matching system. This allows a ukernel to be selected only when the target CPU has the required features (e.g., `+v`, `+zfh`).

### 3. Define and Implement the RVV Matmul Ukernel in MLIR

It is a self-contained `.mlir` file that is both the specification and implementation of the kernel.

*   Create the `iree_uk_riscv_dt_matmul_f16f16f32.mlir` file with the following structure:

    ```mlir
    // Ukernel for f16*f16->f32 on RVV. Bias is handled by compiler fusion.

    // --- Declarative Metadata ---
    #rvv_f16_layout = #iree_cpu.data_tiled_layout<
      tile_sizes = [8, 8, 1], inner_blocking_order = [0, 1, 2]>

    iree_codegen.ukernel.generic @riscv_matmul_8x8x1_f16f16f32
        match="linalg.matmul"
        with
          operand_types<tensor<...xf16>, tensor<...xf16>, tensor<...xf32>> and
          cpu_feature<"+v", "+zfh">
        provides layout<#rvv_f16_layout>
    {
      // --- Ukernel Implementation (using generic vector ops) ---
      util.func public @main(
          %lhs: tensor<8x1xf16>, %rhs: tensor<1x8xf16>, %init: tensor<8x8xf32>) -> tensor<8x8xf32>
      {
        %vtype_f16 = vector<8xf16>
        %vtype_f32 = vector<8xf32>

        // Load RHS vector (reused for all 8 rows of LHS).
        %rhs_vec = vector.load %vtype_f16, %rhs[0, 0] : tensor<1x8xf16>, vector<8xf16>

        // FUSION POINT: Load initial accumulators. The compiler will feed the
        // bias tensor into the '%init' argument.
        %acc0 = vector.load %vtype_f32, %init[0, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc1 = vector.load %vtype_f32, %init[1, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc2 = vector.load %vtype_f32, %init[2, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc3 = vector.load %vtype_f32, %init[3, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc4 = vector.load %vtype_f32, %init[4, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc5 = vector.load %vtype_f32, %init[5, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc6 = vector.load %vtype_f32, %init[6, 0] : tensor<8x8xf32>, vector<8xf32>
        %acc7 = vector.load %vtype_f32, %init[7, 0] : tensor<8x8xf32>, vector<8xf32>

        // Core matmul logic using generic vector FMA.
        %lhs_s0 = memref.load %lhs[0, 0] : tensor<8x1xf16>
        %b_lhs_s0 = vector.broadcast %lhs_s0 : f16 to vector<8xf16>
        %res0 = vector.fma %b_lhs_s0, %rhs_vec, %acc0 : vector<8xf32>

        %lhs_s1 = memref.load %lhs[1, 0] : tensor<8x1xf16>
        %b_lhs_s1 = vector.broadcast %lhs_s1 : f16 to vector<8xf16>
        %res1 = vector.fma %b_lhs_s1, %rhs_vec, %acc1 : vector<8xf32>

        // ... (repeated for res2 through res7) ...

        %lhs_s7 = memref.load %lhs[7, 0] : tensor<8x1xf16>
        %b_lhs_s7 = vector.broadcast %lhs_s7 : f16 to vector<8xf16>
        %res7 = vector.fma %b_lhs_s7, %rhs_vec, %acc7 : vector<8xf32>

        // Assemble final result tensor.
        %out0 = tensor.insert %res0 into %init[0, 0] : tensor<8x8xf32>
        %out1 = tensor.insert %res1 into %out0[1, 0] : tensor<8x8xf32>
        // ... (repeated for out2 through out7) ...
        %out7 = tensor.insert %res7 into %out6[7, 0] : tensor<8x8xf32>

        util.return %out7 : tensor<8x8xf32>
      }
    }
    ```

### 4. Integrate Ukernel into IREE's Build and Codegen Pipeline


1.  Place the ukernel `.mlir` file in a directory like `iree/compiler/Codegen/LLVMCPU/builtins/mlir_ukernel/riscv/`.
2.  Add a PDL pattern in `iree/compiler/Codegen/LLVMCPU/builtins/pdl/` to match the `linalg.matmul` and rewrite it using the `iree_codegen.match.ukernel` rewriter.
3.  Ensure the `LowerTensorUKernelsPass` is part of the LLVMCPU codegen pipeline and correctly finds and inlines the ukernel's `@main` function.

### 5. Define a Comprehensive Test Suite

1.  **Integration Test:** Add an IREE test case that compiles a `linalg.matmul` with a preceding `linalg.fill` (for bias). The test should target RVV and check the IR after the `LowerTensorUKernelsPass` to verify that the ukernel was inlined and that the `linalg.fill` is correctly feeding its output into the ukernel's `init` argument.
    ```mlir
    // test/iree/compiler/Codegen/LLVMCPU/riscv_matmul_bias_fusion.mlir
    // RUN: iree-compile ... --iree-llvmcpu-target-triple=riscv64... --iree-llvmcpu-target-cpu-features="+v,+zfh" ...
    // CHECK: func.func private @riscv_matmul_8x8x1_f16f16f32_dispatch
    // CHECK: vector.load %{{.*}} : memref<...>, vector<8xf32>

    func.func @test_matmul_bias(%lhs: tensor<512x256xf16>, %rhs: tensor<256x1024xf16>, %bias: tensor<1024xf32>) -> tensor<512x1024xf32> {
      %init = linalg.fill ins(%bias) outs(...) : tensor<512x1024xf32>
      %res = linalg.matmul ins(%lhs, %rhs) outs(%init) -> tensor<512x1024xf32>
      return %res : tensor<512x1024xf32>
    }
    ```
2.  **End-to-End Test:** Create an e2e test that runs the compiled module on an emulator (e.g., QEMU) or hardware and verifies the numerical correctness of the fused `matmul+bias` operation against a reference implementation.

### What component(s) does this issue relate to?

MLIR

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLIR] [uKernel] Implement MLIR-based Ukernels for RISC-V Vector (RVV) Extension #22720

Request description

Objective

MVP Scope

Proposed Engineering Plan

1. Ensure Robust Lowering from `vector` Dialect to RVV in LLVM

2. Implement IREE Infrastructure for CPU Ukernel Declaration

3. Define and Implement the RVV Matmul Ukernel in MLIR

4. Integrate Ukernel into IREE's Build and Codegen Pipeline

5. Define a Comprehensive Test Suite

What component(s) does this issue relate to?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MLIR] [uKernel] Implement MLIR-based Ukernels for RISC-V Vector (RVV) Extension #22720

Description

Request description

Objective

MVP Scope

Proposed Engineering Plan

1. Ensure Robust Lowering from vector Dialect to RVV in LLVM

2. Implement IREE Infrastructure for CPU Ukernel Declaration

3. Define and Implement the RVV Matmul Ukernel in MLIR

4. Integrate Ukernel into IREE's Build and Codegen Pipeline

5. Define a Comprehensive Test Suite

What component(s) does this issue relate to?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Ensure Robust Lowering from `vector` Dialect to RVV in LLVM