-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RVV] Optimize Generic RVV Matmul codegen #18986
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Bruce Lai <[email protected]>
For RISC-V path, matmul output tile size is set to [m,n,k]=[8,32,1] by default. It is efficient for vl = 512. It generates code using LMUL=2 and 18 vector registers. It makes the vl != 512 cases generate inefficient code. For vl = 1024, it will generate LMUL=1 code which only use 8+1 vec registers. For vl = 256, it will generate LMUL=4 code which introduce vector register spill. This commit detects if RISC-V Vector is supported, then calling `getMatmulRISCVVectorSizes`. This function aims to fully utilize vector registers, so it generates tile size [7, vlmax_lmul4, 1]. Using lmul4 can fully utilize all 32 vector registers. For now, it only applies to `nonWideningLinalgElementType`. It uses `native_vector_size` and data_type to determine `vl for LMUL=4`. Comparing tile setting [8, m2] vs [7, m4] on XNNPACK using SiFive platforms shows performance improvement around 25%~50%. Example: `native_vector_size` is 256 for vl=1024 & default LMUL=2. The output tile size: m = 7 n = 256 * 8 * 2 / 32 = 128 => equals to vlmax (SEW=32 & LMUL=4) k = 1 Signed-off-by: Bruce Lai <[email protected]>
Use default tile size for matmul_transpose_b & batch_matmul_transpose_b to avoid performance drop when using getMatmulVectorSizesUsingFullVectorHeuristics. If tile size is set to [m, n] = [8, 32], the output asm uses vector slide and fmacc instuctions in the inner most loop. This results in acceptable performance, though not optimal. However, if we increase the tile size along the N dimension, the output asm includes numerous scalar load and store instructions within the innermost loop, which degrades performance. Therefore, we use the default tile size for matmul_transpose_b. Signed-off-by: Bruce Lai <[email protected]>
Default maximum cache_tile_size is clDefaultDistTileSize(64). If RVV is enabled and vl = 1024, the register_tile_size is [m, n, k]=[7, 128, 1]. In this case, minTileSizes[n-dim] equals to 128(register_tile_size in n-dim), it makes maxTileSizes < minTileSizes in n-dim. This commit guarantees maxTileSizes >= minTileSizes and it imply that cache_tile_size >= register_tile_size in n-dim. Note: Only apply matmul cache_tile_size modification for RVV now to workaround test `select_x86_64_lowering_strategy.mlir.test` fail. Signed-off-by: Bruce Lai <[email protected]>
This option enlarges maxTileSizes to twice of minTileSizes if maxTileSizes > minTileSizes. To make cache tile size become twice to register tile size. Assume VLEN=1024, so register tile size [m, n] = [7, 128]. By default, maxTileSizes[n-dim] = 64. This option makes maxTileSizes change to 256, i.e twice of n-dim register tile size. In other word, n-dim cache tiling is guaranteed and reduces cache miss rate. The performance improves 3~5% in general cases. Signed-off-by: Bruce Lai <[email protected]>
Signed-off-by: Bruce Lai <[email protected]>
Signed-off-by: Bruce Lai <[email protected]>
Signed-off-by: Bruce Lai <[email protected]>
FYI. @rednoah91 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I left some comments, please take a look.
auto lhsShape = lhsType.getShape(); | ||
bool has_batch = isa<linalg::BatchMatmulOp, linalg::BatchMatmulTransposeBOp>( | ||
op.getOperation()); | ||
char idx = (has_batch) ? 1 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use int
. Should we use it in all those sizes
access (e.g., should below all become sizes[baseIdx+0/1/2/3/4]
? If so, I think we can rename it to baseIdx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idx
aims to point to m dimension index
in lhsShape
.
How about changing to mDimIdx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In this case, I think what you're looking for is inferContractionDims. This provides a more flexible way to not just look at named ops. You can get the batch/m/n/k dimensions from it.
Furthermore, you can bail out if one of the dimension is greater than one.
EDIT: This is an optional review comment. I don't know your end goal, but I'd like to share that there is an utility for inferring these dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for providing this approach. However, I am confused in why each dimension has size two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of adding this test? IMO, it really looks like a test with transform spec, although you're using C++ passes. It is not really maintainable in the IREE main repo because it is testing a combination of passes. The lowering_config is preset, and the TileAndFuse tests are covered in the other file.
If we want to do the pipeline test, are you able to add a test in pipeline_tests.mlir? Or create a pipeline_aggressive_distribution_riscv_tests.mlir
and add the test to the new file? (Note, you'll need to use builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target))
but not these arbitrary passes.)
The main difference is that the pipeline tests run the full pipeline which includes strategy selection and the lowering. This gives us a better picture about what we're expecting in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of adding this test?
I would like to ensure that the compiler can help generate the desired for-loop struct if using register tile size [7, 64] and cache tile size [64,128].
I agree with you. I should create the other file to run the whole pipeline test. Thank you for providing the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g., here is a contraction op but the number of M dimensions is more than 1.
util.func public @multi_m_dim_generic(%arg0 : tensor<64x4x128xf32>, %arg1 : tensor<128x512xf32>,
%arg2 : tensor<64x4x512xf32>) -> tensor<64x4x512xf32> {
%4 = linalg.generic {
indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>,
affine_map<(d0, d1, d2, d3) -> (d2, d1)>,
affine_map<(d0, d1, d2, d3) -> (d0, d3, d1)>],
iterator_types = ["parallel", "parallel", "parallel", "reduction"]}
ins(%arg0, %arg1 : tensor<64x4x128xf32>, tensor<128x512xf32>) outs(%arg2 : tensor<64x4x512xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%5 = arith.mulf %in, %in_0 : f32
%6 = arith.addf %5, %out : f32
linalg.yield %6 : f32
} -> tensor<64x4x512xf32>
util.return %4 : tensor<64x4x512xf32>
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we face this kind of MLIR in simple pass routing?
I supposed that multi_m & multi_n only happens when enabling mmt4d.
Change From `iree-llvmcpu-aggressive-distribution` to `iree-llvmcpu-riscv-aggressive-distribution`. Also, fix some coding style problems. Signed-off-by: Bruce Lai <[email protected]>
Signed-off-by: Bruce Lai <[email protected]>
Signed-off-by: Bruce Lai <[email protected]>
Hi @hanhanW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, just few nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can collapse this file into the other one. What you're looking for is adding another // RUN...
with the flag. E.g.,
iree/compiler/src/iree/compiler/Codegen/Common/test/decompose_pack_unpack_ops.mlir
Lines 1 to 2 in c651ba9
// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops))" --split-input-file %s | FileCheck %s -check-prefixes=CHECK-ALL,CHECK | |
// RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-decompose-pack-unpack-ops{use-only-reshapes=true}))" --split-input-file %s | FileCheck %s -check-prefixes=CHECK-ALL,CHECK-RESHAPE |
This way reduces the dup tests in different files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll collapse it to compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_tests.mlir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry for not being clear enough. What I meant is collapsing the select_riscv_lowering_strategy_*.mlir
to the select_riscv_lowering_strategy.mlir
file, not the pipeline_test one.
The original pipeline_riscv_aggressive_distribution_tests.mlir
looks good to me, can you recover it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh... My bad... I misunderstood it. Let me revert and merge select_riscv_lowering_strategy_*.mlir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved!
Signed-off-by: Bruce Lai <[email protected]>
Signed-off-by: Bruce Lai <[email protected]>
Hi @hanhanW |
I left the comment about how we structure the tests. Other parts look okay to me. |
…e_tests" This reverts commit d0e0c8c. Signed-off-by: Bruce Lai <[email protected]>
Signed-off-by: Bruce Lai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wh
EDIT: I did not mean to submit this comment, please ignore this one.
|
||
#executable_target_embedded_elf_riscv_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-riscv_64", {cpu_features = "+m,+a,+f,+d,+zvl1024b,+v", data_layout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128", native_vector_size = 256 : index, target_triple = "riscv64-unknown-unknown-eabi-elf"}> | ||
builtin.module { | ||
func.func @matmul_riscv_vl1024() attributes {hal.executable.target = #executable_target_embedded_elf_riscv_64_} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between this matmul_riscv_vl1024
test and the below matmul_riscv_vl1024
test? Can we collapse them into one? (In this case, you can move all your CHECK-AGGRESSIVE
s right after the regular CHECK
s). E.g.,
module {....}
// CHECK: ...
// CHECK: ...
// CHECK: ...
...
// CHECK-AGGRESSIVE: ...
// CHECK-AGGRESSIVE: ...
// CHECK-AGGRESSIVE: ...
...
|
||
#executable_target_embedded_elf_riscv_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-riscv_64", {cpu_features = "+m,+a,+f,+d,+zvl512b,+v", data_layout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128", native_vector_size = 128 : index, target_triple = "riscv64-unknown-unknown-eabi-elf"}> | ||
builtin.module { | ||
func.func @gemv_riscv_vl512() attributes {hal.executable.target = #executable_target_embedded_elf_riscv_64_} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, I can't distinguish the difference between this and the above one based on the name. What is the difference?
Background
RVV matmul always uses register tile size [m, n] = [8, 32]. It's ok for VLEN=512. It can generate outer product GEMM with LMUL=2. The vector register utilization rate is 56%(18/32). However, it uses LMUL=1 for VLEN=1024 and the vector register utilization rate is 28%. If we use VLEN = 128. The codegen uses LMUL=8 and vector registers are not enough. It introduces register spill and causes performance drop.
Optimization
Two optimizations are included in this PR to improve RVV matmul code generation.
Register tiling optimization: We introduce VLEN-aware register tiling. The register tile size is changed from [m, n, k] = [8, 32, 1] to [7, vlmax_lmul_4, 1]. This can not only fix the problem of generating inefficient code for
VLEN!=512
but also increase the vector register utilization rate to 100%. Based on experiments on SiFive platforms, the performance improvement is around 25%~50%.Cache tiling optimization:
maxTileSize
is set to 64 by default. To solve the problem, we guaranteemaxTileSize >= minTileSize
, which implies that cache tile size >= register tile size.iree-llvmcpu-aggressive-distribution
makes cache tile size become twice of register tile size ifmaxTileSize < minTileSize
. It reduces cache miss rate and leads to performance improvement. Performance improvement is 3~5% in general.In summary, my experiment on the SiFive VLEN 1024 platform shows x5.4 speedup for Bert-Large after applying the patches.