Skip to content

Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.

Notifications You must be signed in to change notification settings

NVIDIA/audio-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Audio Intelligence

Overview

This repository contains implementation of several state-of-the-art audio intelligence research projects from NVIDIA.

Projects

Audio Understanding, Generation, and Reasoning

Audio Flamingo 3 (Audio Understanding)

Advancing Audio Intelligence with Fully Open Large Audio Language Models


Audio Flamingo 3 is a 7B audio language model using the LLaVA architecture for audio understanding. We trained our unified AF-Whisper audio encoder based on Whisper to handle understanding beyond speech recognition. We included speech-related tasks in Audio Flamingo 3 and scaled up the training dataset to about 50M audio-text pairs. Therefore, Audio Flamingo 3 is able to handle all three modalities in audio: sound, music, and speech. It outperforms prior SOTA models on a number of understanding and reasoning benchmarks. Audio Flamingo 3 can take up to 10 minutes of audio inputs, and has a streaming TTS module (AF3-Chat) to output voice.


UALM (Audio Understanding and Generation)


ETTA (Audio Generation)

Elucidating the Design Space of Text-to-Audio Models

Improving Text-To-Audio Models with Synthetic Captions

ETTA is a 1.4B latent diffusion model for text-to-audio generation. We trained ETTA on over 1M synthetic captions annotated by Audio Flamingo, and proved that this approach can lead to high quality audio generation as well as emergent abilities with scale.


Fugatto 1 (Audio Editing and Generation)

Foundational Generative Audio Transformer Opus 1


Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs.


DCASE 2025 Challenge Task 5 (Audio Understanding Challenge)

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in the DCASE 2025 Challenge


TangoFlux (Audio Generation)

Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization


TangoFlux is an efficient and high-quality text-to-audio model with FluxTransformer and CLAP-ranked preference optimization. This project was in collaboration with SUTD and Lambda Labs.


OMCAT (Audio-Visual Understanding)

Omni Context Aware Transformer

OMCAT is an audio-visual understanding model with ROTE (Rotary Time Embeddings).






Representation Learning

UniWav (Speech Codec)

Towards Unified Pre-training for Speech Representation Learning and Generation






Audio Enhancement

A2SB (Bandwidth Extension and Inpainting)

Audio-to-Audio Schrodinger Bridges


A2SB is an audio restoration model tailored for high-res music at 44.1kHz. It is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, and able to restore hour-long audio inputs. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets.


CleanUNet (Speech Denoising)

CleanUNet: Speech Denoising in the Waveform Domain with Self-Attention

Cleanunet 2: A hybrid speech denoising model on waveform and spectrogram


CleanUNet is a causal speech denoising model on the raw waveform. CleanUNet 2 is a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds.






Text-to-Speech Models

BigVGAN-v2

A Universal Neural Vocoder with Large-Scale Training


BigVGAN-v2 is a widely-used universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We release our checkpoints with various configurations such as sampling rates.


P-Flow and A2-Flow

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

A2-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching


Please refer to the Magpie-TTS API for commercial use of NVIDIA's TTS models that leveraged techniques of these papers.


DiffWave

A Versatile Diffusion Model for Audio Synthesis


DiffWave is the first diffusion model for raw waveform synthesis. It is a versatile waveform synthesis model for speech and non-speech generation.


WaveGlow

A Flow-based Generative Network for Speech Synthesis



Flowtron

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis



RAD

RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

RAD-MMM: Multilingual Multiaccented Multispeaker TTS with RADTTS

License

The codes for different projects may be released under different licenses, including MIT, NVIDIA OneWay Noncommercial License, NVIDIA Sourcecode License, and so on. Please refer to each project folder or their original GitHub links for the detailed licenses.

About

Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages