-
Notifications
You must be signed in to change notification settings - Fork 802
Open
Description
Input IR:
module {
func.func @main(%arg0: tensor<147456x512xbf16>) -> tensor<512x147456xbf16> {
%0 = tensor.empty() : tensor<512x147456xbf16>
%1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1, d0)>], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<147456x512xbf16>) outs(%0 : tensor<512x147456xbf16>) {
^bb0(%in: bf16, %out: bf16):
linalg.yield %in : bf16
} -> tensor<512x147456xbf16>
return %1 : tensor<512x147456xbf16>
}
}Perf:
$ iree-compile --iree-opt-level=O3 --iree-hal-target-device=hip --iree-hip-target=gfx942 transpose.mlir -o transpose.vmfb
$ iree-benchmark-module --module=transpose.vmfb --device=hip://0 --input=147456x512xbf16 --function=main
...
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------
BM_main/process_time/real_time 0.178 ms 0.217 ms 3802 items_per_second=5.60335k/s
Benchmark time is ~180us (pure dispatch time is ~130us), aiming for ~20us.
- currently goes down
LLVMGPUTransposeSharedMem - should try using TileAndFuse, and make sure loads are vectorized
Metadata
Metadata
Assignees
Labels
No labels