[amdgpu] Poor transpose performance

Input IR:
```mlir
module {
  func.func @main(%arg0: tensor<147456x512xbf16>) -> tensor<512x147456xbf16> {
    %0 = tensor.empty() : tensor<512x147456xbf16>
    %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1, d0)>], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<147456x512xbf16>) outs(%0 : tensor<512x147456xbf16>) {
    ^bb0(%in: bf16, %out: bf16):
      linalg.yield %in : bf16
    } -> tensor<512x147456xbf16>
    return %1 : tensor<512x147456xbf16>
  }
}
```
Perf:
```
$ iree-compile --iree-opt-level=O3 --iree-hal-target-device=hip --iree-hip-target=gfx942 transpose.mlir -o transpose.vmfb
$ iree-benchmark-module --module=transpose.vmfb  --device=hip://0 --input=147456x512xbf16 --function=main
...
-----------------------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------
BM_main/process_time/real_time      0.178 ms        0.217 ms         3802 items_per_second=5.60335k/s
```
Benchmark time is ~180us (pure dispatch time is ~130us), aiming for ~20us.
- currently goes down `LLVMGPUTransposeSharedMem`
- should try using TileAndFuse, and make sure loads are vectorized

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[amdgpu] Poor transpose performance #22718

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[amdgpu] Poor transpose performance #22718

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions