This repository contains implementation of several state-of-the-art audio intelligence research projects from NVIDIA.
Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 is a 7B audio language model using the LLaVA architecture for audio understanding. We trained our unified AF-Whisper audio encoder based on Whisper to handle understanding beyond speech recognition. We included speech-related tasks in Audio Flamingo 3 and scaled up the training dataset to about 50M audio-text pairs. Therefore, Audio Flamingo 3 is able to handle all three modalities in audio: sound, music, and speech. It outperforms prior SOTA models on a number of understanding and reasoning benchmarks. Audio Flamingo 3 can take up to 10 minutes of audio inputs, and has a streaming TTS module (AF3-Chat) to output voice.
Elucidating the Design Space of Text-to-Audio Models
Improving Text-To-Audio Models with Synthetic Captions
ETTA is a 1.4B latent diffusion model for text-to-audio generation. We trained ETTA on over 1M synthetic captions annotated by Audio Flamingo, and proved that this approach can lead to high quality audio generation as well as emergent abilities with scale.
Foundational Generative Audio Transformer Opus 1
Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs.
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in the DCASE 2025 Challenge
Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization
TangoFlux is an efficient and high-quality text-to-audio model with FluxTransformer and CLAP-ranked preference optimization. This project was in collaboration with SUTD and Lambda Labs.
Omni Context Aware Transformer
OMCAT is an audio-visual understanding model with ROTE (Rotary Time Embeddings).
Towards Unified Pre-training for Speech Representation Learning and Generation
Audio-to-Audio Schrodinger Bridges
A2SB is an audio restoration model tailored for high-res music at 44.1kHz. It is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, and able to restore hour-long audio inputs. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets.
CleanUNet: Speech Denoising in the Waveform Domain with Self-Attention
Cleanunet 2: A hybrid speech denoising model on waveform and spectrogram
CleanUNet is a causal speech denoising model on the raw waveform. CleanUNet 2 is a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds.
A Universal Neural Vocoder with Large-Scale Training
BigVGAN-v2 is a widely-used universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We release our checkpoints with various configurations such as sampling rates.
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
A2-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching
Please refer to the Magpie-TTS API for commercial use of NVIDIA's TTS models that leveraged techniques of these papers.
A Versatile Diffusion Model for Audio Synthesis
DiffWave is the first diffusion model for raw waveform synthesis. It is a versatile waveform synthesis model for speech and non-speech generation.
A Flow-based Generative Network for Speech Synthesis
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis
RAD-MMM: Multilingual Multiaccented Multispeaker TTS with RADTTS
The codes for different projects may be released under different licenses, including MIT, NVIDIA OneWay Noncommercial License, NVIDIA Sourcecode License, and so on. Please refer to each project folder or their original GitHub links for the detailed licenses.