Skip to content

[amdgpu] Poor transpose performance #22718

@rkayaith

Description

@rkayaith

Input IR:

module {
  func.func @main(%arg0: tensor<147456x512xbf16>) -> tensor<512x147456xbf16> {
    %0 = tensor.empty() : tensor<512x147456xbf16>
    %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1, d0)>], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<147456x512xbf16>) outs(%0 : tensor<512x147456xbf16>) {
    ^bb0(%in: bf16, %out: bf16):
      linalg.yield %in : bf16
    } -> tensor<512x147456xbf16>
    return %1 : tensor<512x147456xbf16>
  }
}

Perf:

$ iree-compile --iree-opt-level=O3 --iree-hal-target-device=hip --iree-hip-target=gfx942 transpose.mlir -o transpose.vmfb
$ iree-benchmark-module --module=transpose.vmfb  --device=hip://0 --input=147456x512xbf16 --function=main
...
-----------------------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------
BM_main/process_time/real_time      0.178 ms        0.217 ms         3802 items_per_second=5.60335k/s

Benchmark time is ~180us (pure dispatch time is ~130us), aiming for ~20us.

  • currently goes down LLVMGPUTransposeSharedMem
  • should try using TileAndFuse, and make sure loads are vectorized

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions