Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RVV] Optimize Generic RVV Matmul codegen #18986

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

bhbruce
Copy link
Contributor

@bhbruce bhbruce commented Nov 1, 2024

Background

RVV matmul always uses register tile size [m, n] = [8, 32]. It's ok for VLEN=512. It can generate outer product GEMM with LMUL=2. The vector register utilization rate is 56%(18/32). However, it uses LMUL=1 for VLEN=1024 and the vector register utilization rate is 28%. If we use VLEN = 128. The codegen uses LMUL=8 and vector registers are not enough. It introduces register spill and causes performance drop.

Optimization

Two optimizations are included in this PR to improve RVV matmul code generation.

  1. Register tiling optimization: We introduce VLEN-aware register tiling. The register tile size is changed from [m, n, k] = [8, 32, 1] to [7, vlmax_lmul_4, 1]. This can not only fix the problem of generating inefficient code for VLEN!=512 but also increase the vector register utilization rate to 100%. Based on experiments on SiFive platforms, the performance improvement is around 25%~50%.

  2. Cache tiling optimization:

  • After introducing VLEN-aware register tiling, the original cache tiling code may break since maxTileSize is set to 64 by default. To solve the problem, we guarantee maxTileSize >= minTileSize, which implies that cache tile size >= register tile size.
  • Introducing option iree-llvmcpu-aggressive-distribution makes cache tile size become twice of register tile size if maxTileSize < minTileSize. It reduces cache miss rate and leads to performance improvement. Performance improvement is 3~5% in general.

In summary, my experiment on the SiFive VLEN 1024 platform shows x5.4 speedup for Bert-Large after applying the patches.

bhbruce and others added 8 commits October 30, 2024 01:01
For RISC-V path, matmul output tile size is set to [m,n,k]=[8,32,1] by default.
It is efficient for vl = 512. It generates code using LMUL=2 and 18 vector registers.
It makes the vl != 512 cases generate inefficient code.
For vl = 1024, it will generate LMUL=1 code which only use 8+1 vec registers.
For vl = 256, it will generate LMUL=4 code which introduce vector register spill.

This commit detects if RISC-V Vector is supported, then calling `getMatmulRISCVVectorSizes`.
This function aims to fully utilize vector registers, so it generates tile size [7, vlmax_lmul4, 1].
Using lmul4 can fully utilize all 32 vector registers.

For now, it only applies to `nonWideningLinalgElementType`.
It uses `native_vector_size` and data_type to determine `vl for LMUL=4`.

Comparing tile setting [8, m2] vs [7, m4] on XNNPACK using SiFive platforms shows performance
improvement around 25%~50%.

Example:
`native_vector_size` is 256 for vl=1024 & default LMUL=2.
  The output tile size:
  m = 7
  n = 256 * 8 * 2 / 32 = 128 => equals to vlmax (SEW=32 & LMUL=4)
  k = 1

Signed-off-by: Bruce Lai <[email protected]>
Use default tile size for matmul_transpose_b & batch_matmul_transpose_b
to avoid performance drop when using getMatmulVectorSizesUsingFullVectorHeuristics.

If tile size is set to [m, n] = [8, 32], the output asm uses vector slide and fmacc instuctions
in the inner most loop. This results in acceptable performance, though not optimal.

However, if we increase the tile size along the N dimension, the output asm includes
numerous scalar load and store instructions within the innermost loop, which degrades performance.

Therefore, we use the default tile size for matmul_transpose_b.

Signed-off-by: Bruce Lai <[email protected]>
Default maximum cache_tile_size is clDefaultDistTileSize(64).
If RVV is enabled and vl = 1024, the register_tile_size is [m, n, k]=[7, 128, 1].

In this case, minTileSizes[n-dim] equals to 128(register_tile_size in n-dim),
it makes maxTileSizes < minTileSizes in n-dim.

This commit guarantees maxTileSizes >= minTileSizes and it imply that
cache_tile_size >= register_tile_size in n-dim.

Note: Only apply matmul cache_tile_size modification for RVV now to workaround
test `select_x86_64_lowering_strategy.mlir.test` fail.

Signed-off-by: Bruce Lai <[email protected]>
This option enlarges maxTileSizes to twice of minTileSizes if maxTileSizes > minTileSizes.
To make cache tile size become twice to register tile size.

Assume VLEN=1024, so register tile size [m, n] = [7, 128].
By default, maxTileSizes[n-dim] = 64.
This option makes maxTileSizes change to 256, i.e twice of n-dim register tile size.
In other word, n-dim cache tiling is guaranteed and reduces cache miss rate.

The performance improves 3~5% in general cases.

Signed-off-by: Bruce Lai <[email protected]>
@bhbruce
Copy link
Contributor Author

bhbruce commented Nov 1, 2024

FYI. @rednoah91

@bhbruce bhbruce changed the title [RVV] Optimize Generic Matmul RVV codegen performance [RVV] Optimize Generic RVV Matmul codegen Nov 1, 2024
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I left some comments, please take a look.

auto lhsShape = lhsType.getShape();
bool has_batch = isa<linalg::BatchMatmulOp, linalg::BatchMatmulTransposeBOp>(
op.getOperation());
char idx = (has_batch) ? 1 : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use int. Should we use it in all those sizes access (e.g., should below all become sizes[baseIdx+0/1/2/3/4]? If so, I think we can rename it to baseIdx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idx aims to point to m dimension index in lhsShape.
How about changing to mDimIdx?

Copy link
Contributor

@hanhanW hanhanW Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In this case, I think what you're looking for is inferContractionDims. This provides a more flexible way to not just look at named ops. You can get the batch/m/n/k dimensions from it.

Furthermore, you can bail out if one of the dimension is greater than one.

EDIT: This is an optional review comment. I don't know your end goal, but I'd like to share that there is an utility for inferring these dimensions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for providing this approach. However, I am confused in why each dimension has size two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of adding this test? IMO, it really looks like a test with transform spec, although you're using C++ passes. It is not really maintainable in the IREE main repo because it is testing a combination of passes. The lowering_config is preset, and the TileAndFuse tests are covered in the other file.

If we want to do the pipeline test, are you able to add a test in pipeline_tests.mlir? Or create a pipeline_aggressive_distribution_riscv_tests.mlir and add the test to the new file? (Note, you'll need to use builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target)) but not these arbitrary passes.)

The main difference is that the pipeline tests run the full pipeline which includes strategy selection and the lowering. This gives us a better picture about what we're expecting in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of adding this test?

I would like to ensure that the compiler can help generate the desired for-loop struct if using register tile size [7, 64] and cache tile size [64,128].

I agree with you. I should create the other file to run the whole pipeline test. Thank you for providing the details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g., here is a contraction op but the number of M dimensions is more than 1.

util.func public @multi_m_dim_generic(%arg0 : tensor<64x4x128xf32>, %arg1 : tensor<128x512xf32>,
    %arg2 : tensor<64x4x512xf32>) -> tensor<64x4x512xf32> {
    %4 = linalg.generic {
        indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>,
                         affine_map<(d0, d1, d2, d3) -> (d2, d1)>,
                         affine_map<(d0, d1, d2, d3) -> (d0, d3, d1)>],
        iterator_types = ["parallel", "parallel", "parallel", "reduction"]}
        ins(%arg0, %arg1 : tensor<64x4x128xf32>, tensor<128x512xf32>) outs(%arg2 : tensor<64x4x512xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %5 = arith.mulf %in, %in_0 : f32
      %6 = arith.addf %5, %out : f32
      linalg.yield %6 : f32
    } -> tensor<64x4x512xf32>
  util.return %4 : tensor<64x4x512xf32>
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we face this kind of MLIR in simple pass routing?
I supposed that multi_m & multi_n only happens when enabling mmt4d.

Change From `iree-llvmcpu-aggressive-distribution` to `iree-llvmcpu-riscv-aggressive-distribution`.
Also, fix some coding style problems.

Signed-off-by: Bruce Lai <[email protected]>
@bhbruce
Copy link
Contributor Author

bhbruce commented Nov 7, 2024

Hi @hanhanW
Thank you for your suggestions. I've followed your reviewed comments to modify it.
I appended several commits at the end of the original commits.

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, just few nits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can collapse this file into the other one. What you're looking for is adding another // RUN... with the flag. E.g.,

// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops))" --split-input-file %s | FileCheck %s -check-prefixes=CHECK-ALL,CHECK
// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops{use-only-reshapes=true}))" --split-input-file %s | FileCheck %s -check-prefixes=CHECK-ALL,CHECK-RESHAPE

This way reduces the dup tests in different files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll collapse it to compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_tests.mlir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry for not being clear enough. What I meant is collapsing the select_riscv_lowering_strategy_*.mlir to the select_riscv_lowering_strategy.mlir file, not the pipeline_test one.

The original pipeline_riscv_aggressive_distribution_tests.mlir looks good to me, can you recover it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... My bad... I misunderstood it. Let me revert and merge select_riscv_lowering_strategy_*.mlir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved!

@bhbruce
Copy link
Contributor Author

bhbruce commented Nov 13, 2024

Hi @hanhanW
Based on your review comments, I've updated the PR.
Please help to review.
Thanks!

@hanhanW
Copy link
Contributor

hanhanW commented Nov 14, 2024

Hi @hanhanW Based on your review comments, I've updated the PR. Please help to review. Thanks!

I left the comment about how we structure the tests. Other parts look okay to me.

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wh

EDIT: I did not mean to submit this comment, please ignore this one.


#executable_target_embedded_elf_riscv_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-riscv_64", {cpu_features = "+m,+a,+f,+d,+zvl1024b,+v", data_layout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128", native_vector_size = 256 : index, target_triple = "riscv64-unknown-unknown-eabi-elf"}>
builtin.module {
func.func @matmul_riscv_vl1024() attributes {hal.executable.target = #executable_target_embedded_elf_riscv_64_} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this matmul_riscv_vl1024 test and the below matmul_riscv_vl1024 test? Can we collapse them into one? (In this case, you can move all your CHECK-AGGRESSIVEs right after the regular CHECKs). E.g.,

module {....}
// CHECK: ...
// CHECK: ...
// CHECK: ...
...

// CHECK-AGGRESSIVE: ...
// CHECK-AGGRESSIVE: ...
// CHECK-AGGRESSIVE: ...
...


#executable_target_embedded_elf_riscv_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-riscv_64", {cpu_features = "+m,+a,+f,+d,+zvl512b,+v", data_layout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128", native_vector_size = 128 : index, target_triple = "riscv64-unknown-unknown-eabi-elf"}>
builtin.module {
func.func @gemv_riscv_vl512() attributes {hal.executable.target = #executable_target_embedded_elf_riscv_64_} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, I can't distinguish the difference between this and the above one based on the name. What is the difference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants