-
Notifications
You must be signed in to change notification settings - Fork 802
Description
What happened?
The linalg_ext.fft operation outputs 2 tensors: the real part and the imaginary part of the operation. If the tensors are large enough and one of the output tensors is not used later in the program, the compile will fail with error: 'func.func' op exceeded stack allocation limit of 32768 bytes for function.. This compile error is surprising for users because the exact same program works with smaller tensor sizes, then fails to compile once the tensors are large enough, even if the size of the 1D FFT is the same.
For example, here is the last part of an IRFFT (takes a complex input and returns a real output).
%5:2 = iree_linalg_ext.fft ins(%c1, %cst_8, %cst_7 : index, tensor<1xf32>, tensor<1xf32>) outs(%4#0, %4#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%6:2 = iree_linalg_ext.fft ins(%c2, %cst_6, %cst_5 : index, tensor<2xf32>, tensor<2xf32>) outs(%5#0, %5#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%7:2 = iree_linalg_ext.fft ins(%c3, %cst_4, %cst_3 : index, tensor<4xf32>, tensor<4xf32>) outs(%6#0, %6#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%8:2 = iree_linalg_ext.fft ins(%c4, %cst_2, %cst_1 : index, tensor<8xf32>, tensor<8xf32>) outs(%7#0, %7#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%9:2 = iree_linalg_ext.fft ins(%c5, %cst_0, %cst : index, tensor<16xf32>, tensor<16xf32>) outs(%8#0, %8#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
util.return %9#0 : tensor<32xf32>
Notice that it returns %9#0 and ignores %9#1 because the imaginary part is not needed. The problem is that, once the input sizes get large enough, this errors out with error: 'func.func' op exceeded stack allocation limit. The reason is that the imaginary part of that last FFT got marked as readonly. Here are what the tensors to that last FFT get turned into
%0 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(0) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<32xf32>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(1) alignment(64) offset(%c256) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<32xf32>>
We've been working around this by using the imaginary output in some non-trivial way to essentially fool the compiler into not marking it readonly. I discovered that using util.optimization_barrier %9#1 will also work around the issue.
@MaheshRavishankar suggested in issue #22695 that the right fix would be vectorize the FFT operation.
@hanhanW Moving the conversation here from #22473
Repro case is attached here: irfft.zip
Steps to reproduce your issue
iree-compile --iree-hal-target-device=local --iree-hal-local-target-device-backends=llvm-cpu --iree-llvmcpu-target-cpu=host -o /dev/null irfft.mlir
What component(s) does this issue relate to?
Compiler
Version information
3.9.0 (I also tried 3.8.0 and 3.5.0)
Additional context
No response