[RVV] Optimize Generic RVV Matmul codegen #18986

bhbruce · 2024-11-01T06:08:06Z

Background

RVV matmul always uses register tile size [m, n] = [8, 32]. It's ok for VLEN=512. It can generate outer product GEMM with LMUL=2. The vector register utilization rate is 56%(18/32). However, it uses LMUL=1 for VLEN=1024 and the vector register utilization rate is 28%. If we use VLEN = 128. The codegen uses LMUL=8 and vector registers are not enough. It introduces register spill and causes performance drop.

Optimization

Two optimizations are included in this PR to improve RVV matmul code generation.

Register tiling optimization: We introduce VLEN-aware register tiling. The register tile size is changed from [m, n, k] = [8, 32, 1] to [7, vlmax_lmul_4, 1]. This can not only fix the problem of generating inefficient code for VLEN!=512 but also increase the vector register utilization rate to 100%. Based on experiments on SiFive platforms, the performance improvement is around 25%~50%.
Cache tiling optimization:

After introducing VLEN-aware register tiling, the original cache tiling code may break since maxTileSize is set to 64 by default. To solve the problem, we guarantee maxTileSize >= minTileSize, which implies that cache tile size >= register tile size.
Introducing option iree-llvmcpu-aggressive-distribution makes cache tile size become twice of register tile size if maxTileSize < minTileSize. It reduces cache miss rate and leads to performance improvement. Performance improvement is 3~5% in general.

In summary, my experiment on the SiFive VLEN 1024 platform shows x5.4 speedup for Bert-Large after applying the patches.

Signed-off-by: Bruce Lai <[email protected]>

For RISC-V path, matmul output tile size is set to [m,n,k]=[8,32,1] by default. It is efficient for vl = 512. It generates code using LMUL=2 and 18 vector registers. It makes the vl != 512 cases generate inefficient code. For vl = 1024, it will generate LMUL=1 code which only use 8+1 vec registers. For vl = 256, it will generate LMUL=4 code which introduce vector register spill. This commit detects if RISC-V Vector is supported, then calling `getMatmulRISCVVectorSizes`. This function aims to fully utilize vector registers, so it generates tile size [7, vlmax_lmul4, 1]. Using lmul4 can fully utilize all 32 vector registers. For now, it only applies to `nonWideningLinalgElementType`. It uses `native_vector_size` and data_type to determine `vl for LMUL=4`. Comparing tile setting [8, m2] vs [7, m4] on XNNPACK using SiFive platforms shows performance improvement around 25%~50%. Example: `native_vector_size` is 256 for vl=1024 & default LMUL=2. The output tile size: m = 7 n = 256 * 8 * 2 / 32 = 128 => equals to vlmax (SEW=32 & LMUL=4) k = 1 Signed-off-by: Bruce Lai <[email protected]>

Use default tile size for matmul_transpose_b & batch_matmul_transpose_b to avoid performance drop when using getMatmulVectorSizesUsingFullVectorHeuristics. If tile size is set to [m, n] = [8, 32], the output asm uses vector slide and fmacc instuctions in the inner most loop. This results in acceptable performance, though not optimal. However, if we increase the tile size along the N dimension, the output asm includes numerous scalar load and store instructions within the innermost loop, which degrades performance. Therefore, we use the default tile size for matmul_transpose_b. Signed-off-by: Bruce Lai <[email protected]>

Default maximum cache_tile_size is clDefaultDistTileSize(64). If RVV is enabled and vl = 1024, the register_tile_size is [m, n, k]=[7, 128, 1]. In this case, minTileSizes[n-dim] equals to 128(register_tile_size in n-dim), it makes maxTileSizes < minTileSizes in n-dim. This commit guarantees maxTileSizes >= minTileSizes and it imply that cache_tile_size >= register_tile_size in n-dim. Note: Only apply matmul cache_tile_size modification for RVV now to workaround test `select_x86_64_lowering_strategy.mlir.test` fail. Signed-off-by: Bruce Lai <[email protected]>

This option enlarges maxTileSizes to twice of minTileSizes if maxTileSizes > minTileSizes. To make cache tile size become twice to register tile size. Assume VLEN=1024, so register tile size [m, n] = [7, 128]. By default, maxTileSizes[n-dim] = 64. This option makes maxTileSizes change to 256, i.e twice of n-dim register tile size. In other word, n-dim cache tiling is guaranteed and reduces cache miss rate. The performance improves 3~5% in general cases. Signed-off-by: Bruce Lai <[email protected]>

Signed-off-by: Bruce Lai <[email protected]>

bhbruce · 2024-11-01T06:08:19Z

FYI. @rednoah91

hanhanW

Thanks for the contribution! I left some comments, please take a look.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

hanhanW · 2024-11-04T19:29:20Z

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

+  auto lhsShape = lhsType.getShape();
+  bool has_batch = isa<linalg::BatchMatmulOp, linalg::BatchMatmulTransposeBOp>(
+      op.getOperation());
+  char idx = (has_batch) ? 1 : 0;


nit: use int. Should we use it in all those sizes access (e.g., should below all become sizes[baseIdx+0/1/2/3/4]? If so, I think we can rename it to baseIdx.

This idx aims to point to m dimension index in lhsShape.
How about changing to mDimIdx?

I see. In this case, I think what you're looking for is inferContractionDims. This provides a more flexible way to not just look at named ops. You can get the batch/m/n/k dimensions from it.

Furthermore, you can bail out if one of the dimension is greater than one.

EDIT: This is an optional review comment. I don't know your end goal, but I'd like to share that there is an utility for inferring these dimensions.

Thank you for providing this approach. However, I am confused in why each dimension has size two.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

hanhanW · 2024-11-04T19:40:02Z

compiler/src/iree/compiler/Codegen/LLVMCPU/test/tile_and_fuse_riscv.mlir

What is the purpose of adding this test? IMO, it really looks like a test with transform spec, although you're using C++ passes. It is not really maintainable in the IREE main repo because it is testing a combination of passes. The lowering_config is preset, and the TileAndFuse tests are covered in the other file.

If we want to do the pipeline test, are you able to add a test in pipeline_tests.mlir? Or create a pipeline_aggressive_distribution_riscv_tests.mlir and add the test to the new file? (Note, you'll need to use builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target)) but not these arbitrary passes.)

The main difference is that the pipeline tests run the full pipeline which includes strategy selection and the lowering. This gives us a better picture about what we're expecting in the test.

What is the purpose of adding this test?

I would like to ensure that the compiler can help generate the desired for-loop struct if using register tile size [7, 64] and cache tile size [64,128].

I agree with you. I should create the other file to run the whole pipeline test. Thank you for providing the details.

E.g., here is a contraction op but the number of M dimensions is more than 1.

util.func public @multi_m_dim_generic(%arg0 : tensor<64x4x128xf32>, %arg1 : tensor<128x512xf32>, %arg2 : tensor<64x4x512xf32>) -> tensor<64x4x512xf32> { %4 = linalg.generic { indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d2, d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d3, d1)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : tensor<64x4x128xf32>, tensor<128x512xf32>) outs(%arg2 : tensor<64x4x512xf32>) { ^bb0(%in: f32, %in_0: f32, %out: f32): %5 = arith.mulf %in, %in_0 : f32 %6 = arith.addf %5, %out : f32 linalg.yield %6 : f32 } -> tensor<64x4x512xf32> util.return %4 : tensor<64x4x512xf32> }

Do we face this kind of MLIR in simple pass routing?
I supposed that multi_m & multi_n only happens when enabling mmt4d.

Change From `iree-llvmcpu-aggressive-distribution` to `iree-llvmcpu-riscv-aggressive-distribution`. Also, fix some coding style problems. Signed-off-by: Bruce Lai <[email protected]>

Signed-off-by: Bruce Lai <[email protected]>

bhbruce · 2024-11-07T06:41:04Z

Hi @hanhanW
Thank you for your suggestions. I've followed your reviewed comments to modify it.
I appended several commits at the end of the original commits.

hanhanW

Overall looks good to me, just few nits.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

hanhanW · 2024-11-07T21:24:42Z

...mpiler/Codegen/LLVMCPU/test/select_riscv_lowering_strategy_with_aggressive_distribution.mlir

I think we can collapse this file into the other one. What you're looking for is adding another // RUN... with the flag. E.g.,

iree/compiler/src/iree/compiler/Codegen/Common/test/decompose_pack_unpack_ops.mlir

Lines 1 to 2 in c651ba9

// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops))" --split-input-file %s | FileCheck %s -check-prefixes=CHECK-ALL,CHECK

// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops{use-only-reshapes=true}))" --split-input-file %s | FileCheck %s -check-prefixes=CHECK-ALL,CHECK-RESHAPE

This way reduces the dup tests in different files.

OK. I'll collapse it to compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_tests.mlir.

Oops, sorry for not being clear enough. What I meant is collapsing the select_riscv_lowering_strategy_*.mlir to the select_riscv_lowering_strategy.mlir file, not the pipeline_test one.

The original pipeline_riscv_aggressive_distribution_tests.mlir looks good to me, can you recover it?

oh... My bad... I misunderstood it. Let me revert and merge select_riscv_lowering_strategy_*.mlir.

Signed-off-by: Bruce Lai <[email protected]>

bhbruce · 2024-11-13T09:11:01Z

Hi @hanhanW
Based on your review comments, I've updated the PR.
Please help to review.
Thanks!

hanhanW · 2024-11-14T00:55:32Z

Hi @hanhanW Based on your review comments, I've updated the PR. Please help to review. Thanks!

I left the comment about how we structure the tests. Other parts look okay to me.

…e_tests" This reverts commit d0e0c8c. Signed-off-by: Bruce Lai <[email protected]>

Signed-off-by: Bruce Lai <[email protected]>

hanhanW

Wh

EDIT: I did not mean to submit this comment, please ignore this one.

hanhanW · 2024-11-14T20:11:25Z

compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_lowering_strategy.mlir

+
+#executable_target_embedded_elf_riscv_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-riscv_64", {cpu_features = "+m,+a,+f,+d,+zvl1024b,+v", data_layout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128", native_vector_size = 256 : index, target_triple = "riscv64-unknown-unknown-eabi-elf"}>
+builtin.module {
+  func.func @matmul_riscv_vl1024() attributes {hal.executable.target = #executable_target_embedded_elf_riscv_64_} {


What is the difference between this matmul_riscv_vl1024 test and the below matmul_riscv_vl1024 test? Can we collapse them into one? (In this case, you can move all your CHECK-AGGRESSIVEs right after the regular CHECKs). E.g.,

module {....} // CHECK: ... // CHECK: ... // CHECK: ... ... // CHECK-AGGRESSIVE: ... // CHECK-AGGRESSIVE: ... // CHECK-AGGRESSIVE: ... ...

hanhanW · 2024-11-14T20:12:53Z

compiler/src/iree/compiler/Codegen/LLVMCPU/test/select_riscv_lowering_strategy.mlir

+
+#executable_target_embedded_elf_riscv_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-riscv_64", {cpu_features = "+m,+a,+f,+d,+zvl512b,+v", data_layout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128", native_vector_size = 128 : index, target_triple = "riscv64-unknown-unknown-eabi-elf"}>
+builtin.module {
+  func.func @gemv_riscv_vl512() attributes {hal.executable.target = #executable_target_embedded_elf_riscv_64_} {


Same question, I can't distinguish the difference between this and the above one based on the name. What is the difference?

bhbruce and others added 8 commits October 30, 2024 01:01

Add hasAnyVFeature() to detect RVV vector feature

aef574b

Signed-off-by: Bruce Lai <[email protected]>

Add RVV 7 x m4 matmul lowering_config & tile_and_fuse test

55927e3

Signed-off-by: Bruce Lai <[email protected]>

Optimize RVV matmul gemmv codegen: use m8 instead of m4

a612041

Signed-off-by: Bruce Lai <[email protected]>

Add RVV 1 x m4 GEMV lowering_config test

0101f4c

Signed-off-by: Bruce Lai <[email protected]>

bhbruce requested review from hanhanW and MaheshRavishankar as code owners November 1, 2024 06:08

bhbruce changed the title ~~[RVV] Optimize Generic Matmul RVV codegen performance~~ [RVV] Optimize Generic RVV Matmul codegen Nov 1, 2024

hanhanW requested changes Nov 4, 2024

View reviewed changes

bhbruce added 3 commits November 6, 2024 21:24

Change aggressive option name to include riscv

18d0429

Change From `iree-llvmcpu-aggressive-distribution` to `iree-llvmcpu-riscv-aggressive-distribution`. Also, fix some coding style problems. Signed-off-by: Bruce Lai <[email protected]>

Remove tile_and_fuse_riscv test

98a7754

Signed-off-by: Bruce Lai <[email protected]>

Add riscv-aggressive-distribution pipeline test

e13ba78

Signed-off-by: Bruce Lai <[email protected]>

hanhanW reviewed Nov 7, 2024

View reviewed changes

bhbruce added 2 commits November 13, 2024 00:55

Merge pipeline_riscv_aggressive_distribution_tests to pipeline_tests

d0e0c8c

Signed-off-by: Bruce Lai <[email protected]>

Use linalg::inferContractionDims to infer m-dim index

634a240

Signed-off-by: Bruce Lai <[email protected]>

bhbruce added 2 commits November 14, 2024 04:13

Revert "Merge pipeline_riscv_aggressive_distribution_tests to pipelin…

b0e691b

…e_tests" This reverts commit d0e0c8c. Signed-off-by: Bruce Lai <[email protected]>

Merge select_riscv_lowering_strategy*.mlir test

70891f8

Signed-off-by: Bruce Lai <[email protected]>

bhbruce force-pushed the rvv-matmul branch from d64c0e1 to 70891f8 Compare November 14, 2024 12:13

hanhanW reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RVV] Optimize Generic RVV Matmul codegen #18986

[RVV] Optimize Generic RVV Matmul codegen #18986

bhbruce commented Nov 1, 2024 •

edited

Loading

bhbruce commented Nov 1, 2024

hanhanW left a comment

hanhanW Nov 4, 2024

bhbruce Nov 7, 2024

hanhanW Nov 7, 2024 •

edited

Loading

bhbruce Nov 8, 2024

hanhanW Nov 4, 2024

bhbruce Nov 7, 2024

hanhanW Nov 14, 2024

bhbruce Nov 14, 2024

bhbruce commented Nov 7, 2024

hanhanW left a comment

hanhanW Nov 7, 2024

bhbruce Nov 8, 2024

hanhanW Nov 14, 2024

bhbruce Nov 14, 2024

bhbruce Nov 14, 2024

bhbruce commented Nov 13, 2024

hanhanW commented Nov 14, 2024

hanhanW left a comment •

edited

Loading

hanhanW Nov 14, 2024

hanhanW Nov 14, 2024

	// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops))" --split-input-file %s \| FileCheck %s -check-prefixes=CHECK-ALL,CHECK
	// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops{use-only-reshapes=true}))" --split-input-file %s \| FileCheck %s -check-prefixes=CHECK-ALL,CHECK-RESHAPE

[RVV] Optimize Generic RVV Matmul codegen #18986

Are you sure you want to change the base?

[RVV] Optimize Generic RVV Matmul codegen #18986

Conversation

bhbruce commented Nov 1, 2024 • edited Loading

Background

Optimization

bhbruce commented Nov 1, 2024

hanhanW left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanhanW Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhbruce commented Nov 7, 2024

hanhanW left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhbruce commented Nov 13, 2024

hanhanW commented Nov 14, 2024

hanhanW left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhbruce commented Nov 1, 2024 •

edited

Loading

hanhanW Nov 7, 2024 •

edited

Loading

hanhanW left a comment •

edited

Loading