Audio Intelligence

Overview

This repository contains implementation of several state-of-the-art audio intelligence research projects from NVIDIA.

Projects

Audio Understanding, Generation, and Reasoning

Audio Flamingo 3 (Audio Understanding)

Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio Flamingo 3 is a 7B audio language model using the LLaVA architecture for audio understanding. We trained our unified AF-Whisper audio encoder based on Whisper to handle understanding beyond speech recognition. We included speech-related tasks in Audio Flamingo 3 and scaled up the training dataset to about 50M audio-text pairs. Therefore, Audio Flamingo 3 is able to handle all three modalities in audio: sound, music, and speech. It outperforms prior SOTA models on a number of understanding and reasoning benchmarks. Audio Flamingo 3 can take up to 10 minutes of audio inputs, and has a streaming TTS module (AF3-Chat) to output voice.

UALM (Audio Understanding and Generation)

ETTA (Audio Generation)

Elucidating the Design Space of Text-to-Audio Models

Improving Text-To-Audio Models with Synthetic Captions

ETTA is a 1.4B latent diffusion model for text-to-audio generation. We trained ETTA on over 1M synthetic captions annotated by Audio Flamingo, and proved that this approach can lead to high quality audio generation as well as emergent abilities with scale.

Fugatto 1 (Audio Editing and Generation)

Foundational Generative Audio Transformer Opus 1

Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs.

DCASE 2025 Challenge Task 5 (Audio Understanding Challenge)

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in the DCASE 2025 Challenge

TangoFlux (Audio Generation)

Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization

TangoFlux is an efficient and high-quality text-to-audio model with FluxTransformer and CLAP-ranked preference optimization. This project was in collaboration with SUTD and Lambda Labs.

OMCAT (Audio-Visual Understanding)

Omni Context Aware Transformer

OMCAT is an audio-visual understanding model with ROTE (Rotary Time Embeddings).

Representation Learning

UniWav (Speech Codec)

Towards Unified Pre-training for Speech Representation Learning and Generation

Audio Enhancement

A2SB (Bandwidth Extension and Inpainting)

Audio-to-Audio Schrodinger Bridges

A2SB is an audio restoration model tailored for high-res music at 44.1kHz. It is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, and able to restore hour-long audio inputs. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets.

CleanUNet (Speech Denoising)

CleanUNet: Speech Denoising in the Waveform Domain with Self-Attention

Cleanunet 2: A hybrid speech denoising model on waveform and spectrogram

CleanUNet is a causal speech denoising model on the raw waveform. CleanUNet 2 is a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds.

Text-to-Speech Models

BigVGAN-v2

A Universal Neural Vocoder with Large-Scale Training

BigVGAN-v2 is a widely-used universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We release our checkpoints with various configurations such as sampling rates.

P-Flow and A2-Flow

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

A2-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching

Please refer to the Magpie-TTS API for commercial use of NVIDIA's TTS models that leveraged techniques of these papers.

DiffWave

A Versatile Diffusion Model for Audio Synthesis

DiffWave is the first diffusion model for raw waveform synthesis. It is a versatile waveform synthesis model for speech and non-speech generation.

WaveGlow

A Flow-based Generative Network for Speech Synthesis

Flowtron

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

RAD

RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

RAD-MMM: Multilingual Multiaccented Multispeaker TTS with RADTTS

License

The codes for different projects may be released under different licenses, including MIT, NVIDIA OneWay Noncommercial License, NVIDIA Sourcecode License, and so on. Please refer to each project folder or their original GitHub links for the detailed licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
A2SB		A2SB
AudioFlamingo3		AudioFlamingo3
ETTA		ETTA
UALM		UALM
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio Intelligence

Overview

Projects

Audio Understanding, Generation, and Reasoning

Audio Flamingo 3 (Audio Understanding)

UALM (Audio Understanding and Generation)

ETTA (Audio Generation)

Fugatto 1 (Audio Editing and Generation)

DCASE 2025 Challenge Task 5 (Audio Understanding Challenge)

TangoFlux (Audio Generation)

OMCAT (Audio-Visual Understanding)

Representation Learning

UniWav (Speech Codec)

Audio Enhancement

A2SB (Bandwidth Extension and Inpainting)

CleanUNet (Speech Denoising)

Text-to-Speech Models

BigVGAN-v2

P-Flow and A2-Flow

DiffWave

WaveGlow

Flowtron

RAD

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

NVIDIA/audio-intelligence

Folders and files

Latest commit

History

Repository files navigation

Audio Intelligence

Overview

Projects

Audio Understanding, Generation, and Reasoning

Audio Flamingo 3 (Audio Understanding)

UALM (Audio Understanding and Generation)

ETTA (Audio Generation)

Fugatto 1 (Audio Editing and Generation)

DCASE 2025 Challenge Task 5 (Audio Understanding Challenge)

TangoFlux (Audio Generation)

OMCAT (Audio-Visual Understanding)

Representation Learning

UniWav (Speech Codec)

Audio Enhancement

A2SB (Bandwidth Extension and Inpainting)

CleanUNet (Speech Denoising)

Text-to-Speech Models

BigVGAN-v2

P-Flow and A2-Flow

DiffWave

WaveGlow

Flowtron

RAD

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages