Currently the amount of memory needed is proportional to the length of the audio clip and the size of the model.
Shorter clips can be done much faster on GPU but longer can't be done at all on smaller GPUs.
It's already possible to use CUDA_VISIBLE_DEVICES='' to force CPU use. But that's not very practical as it needs a second pass after the GPU pass has failed on some audio.
One idea would be to incorporate that second pass as part of the operator itself. First it tries the GPU, then the CPU for each audio.
An even smarter system would keep track of the maximum size of audio that works on each device so it doesn't have to waste time trying when new audio exceed that.