arXiv
: Stable Audio Open paper
HuggingFace
: model weights
stable-audio-tools
: code to reproduce Stable Audio
stable-audio-metrics
: code to evaluate Stable Audio
Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder.
This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the any of those autoencoders or neural audio codecs.
Ground truth | Stable Audio Open | Stable Audio 2.0 | DAC |
---|---|---|---|