Skip to content

Getting "op exceeded stack allocation limit" from linalg_ext.fft #22776

@pstarkcdpr

Description

@pstarkcdpr

What happened?

The linalg_ext.fft operation outputs 2 tensors: the real part and the imaginary part of the operation. If the tensors are large enough and one of the output tensors is not used later in the program, the compile will fail with error: 'func.func' op exceeded stack allocation limit of 32768 bytes for function.. This compile error is surprising for users because the exact same program works with smaller tensor sizes, then fails to compile once the tensors are large enough, even if the size of the 1D FFT is the same.

For example, here is the last part of an IRFFT (takes a complex input and returns a real output).

%5:2 = iree_linalg_ext.fft ins(%c1, %cst_8, %cst_7 : index, tensor<1xf32>, tensor<1xf32>) outs(%4#0, %4#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%6:2 = iree_linalg_ext.fft ins(%c2, %cst_6, %cst_5 : index, tensor<2xf32>, tensor<2xf32>) outs(%5#0, %5#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%7:2 = iree_linalg_ext.fft ins(%c3, %cst_4, %cst_3 : index, tensor<4xf32>, tensor<4xf32>) outs(%6#0, %6#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%8:2 = iree_linalg_ext.fft ins(%c4, %cst_2, %cst_1 : index, tensor<8xf32>, tensor<8xf32>) outs(%7#0, %7#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%9:2 = iree_linalg_ext.fft ins(%c5, %cst_0, %cst : index, tensor<16xf32>, tensor<16xf32>) outs(%8#0, %8#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
util.return %9#0 : tensor<32xf32>

Notice that it returns %9#0 and ignores %9#1 because the imaginary part is not needed. The problem is that, once the input sizes get large enough, this errors out with error: 'func.func' op exceeded stack allocation limit. The reason is that the imaginary part of that last FFT got marked as readonly. Here are what the tensors to that last FFT get turned into

%0 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(0) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<32xf32>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(1) alignment(64) offset(%c256) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<32xf32>>

We've been working around this by using the imaginary output in some non-trivial way to essentially fool the compiler into not marking it readonly. I discovered that using util.optimization_barrier %9#1 will also work around the issue.

@MaheshRavishankar suggested in issue #22695 that the right fix would be vectorize the FFT operation.

@hanhanW Moving the conversation here from #22473

Repro case is attached here: irfft.zip

Steps to reproduce your issue

iree-compile --iree-hal-target-device=local --iree-hal-local-target-device-backends=llvm-cpu --iree-llvmcpu-target-cpu=host -o /dev/null irfft.mlir

What component(s) does this issue relate to?

Compiler

Version information

3.9.0 (I also tried 3.8.0 and 3.5.0)

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug 🐞Something isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions