[models] Add AudioFlamingo3 integration #40290

lashahub · 2025-08-19T16:17:29Z

This PR adds support for AudioFlamingo3 (AF3) — NVIDIA’s open large audio language model capable of reasoning over speech, sounds, and music.

Paper: Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Model Weights: nvidia/audio-flamingo-3 on Hugging Face
Original Code: NVIDIA Audio Flamingo GitHub (branch: audio_flamingo_3)

It introduces the following components:

AudioFlamingo3 model class
AudioFlamingo3Processor for preprocessing text + audio
Configuration, modeling, and processing utilities
Example usage

With this integration, AF3 can be loaded directly from the Hugging Face Hub:

from transformers import AudioFlamingo3Processor, AudioFlamingo3

processor = AudioFlamingo3Processor.from_pretrained("nvidia/audio-flamingo-3")
model = AudioFlamingo3.from_pretrained("nvidia/audio-flamingo-3")

prompt = "What is happening in the audio?"
audio = "clap.wav"

input_ids, media, media_meta = processor(prompt, audio)
output_ids = model.generate(
    input_ids=input_ids,
    media=media,
    media_meta=media_meta,
    generation_config=model.default_generation_config,
)
print(processor.decode(output_ids))
# Example output: "A crowd is applauding and cheering."

…ormers into audioflamingo3

ebezzam

Thanks @lashahub for your PR! This is very a exciting model to add to the Transformers library!

I see that you've taken inspiration from Llava, which makes sense as you combine modules of different modalities.

Most of my comments are about rearranging your modules so that they fit the Transformers convention, such that it will be more convenient for others to use and to test your new model. To that end, Llava's and this PR (of another audio model) might serve as a useful example to see what files will be added/modified.

Below are my suggested steps.

1. Refactoring / reorganizing current files according to Transformers convention

You can take inspiration from the above models for refactoring your configuration, modeling and processing files, specifically:

Consolidating your configurations. From what I see you may need only one config like in Llava or two like in Dia for AudioFlamingo3Config and AudioFlamingo3EncoderConfig.
Processor: you can take inspiration from Dia to group your feature extractor, text tokenizer, and audio tokenizer into a single component. Loading pre-trained feature extractors and LLMs can be directly handled by the processor without you (or the user) having to manually load the corresponding weights (like you've done below).

# Load components
fe = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v3")
tok = AutoTokenizer.from_pretrained(
    llm_dir,
    padding_side="right",
    use_fast=True,
    legacy=False,
)

To this end, you'll need to create a model conversion script (with something like this) so the configuration files are generated for the processor to know where to pull the relevant models.

The modeling file will diminish quite significantly because we won't apply the audio/text tokenizer inside it but rather in the processor, and the configuration file will also handle pulling the relevant LLM config.

2. Conversion script

This will be needed to convert your model weights / configuration to one that is compatible with the one defined in the above files.

This script can also handle uploading the Transformer compatible model and its configuration to the Hugging Face Hub. You can again take inspiration from Llava and Dia.

3. Testing, documentation, etc

Once your model implementation is consistent with other models implemented in Transformers, there's a lot of boilerplate code we can reuse to make using model convenient and to apply various testing suites. For example, you can see the docs, src/transformers/models/auto, and tests/models/dia folders of the Dia PR on how to prepare / modify the relevant files.

Hope that helps and let me know if you have any questions!

src/transformers/models/audioflamingo3/configuration_audioflamingo3.py

src/transformers/models/audioflamingo3/__init__.py

src/transformers/models/audioflamingo3/configuration_audioflamingo3.py

src/transformers/models/audioflamingo3/modeling_audioflamingo3.py

src/transformers/models/audioflamingo3/configuration_audioflamingo3.py

src/transformers/models/audioflamingo3/modeling_audioflamingo3.py

…ormers into audioflamingo3

github-actions · 2025-11-12T06:23:46Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

lashahub · 2025-11-12T06:30:33Z

@ebezzam The tests were converting the batch to bf16 before generation, so I removed all those conversions, it’s working fine on my end now. I also replaced one of the audio files, even though the previous one would’ve worked fine.

ebezzam · 2025-11-12T08:23:15Z

run-slow: audioflamingo3

github-actions · 2025-11-12T08:24:27Z

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3"]
quantizations: []

github-actions · 2025-11-12T08:37:34Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

ebezzam · 2025-11-12T09:32:37Z

@lashahub we still need to keep the models in bf16 for the tests, otherwise they won't load properly (see here)

ebezzam · 2025-11-12T09:33:01Z

run-slow: audioflamingo3

github-actions · 2025-11-12T09:33:10Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

github-actions · 2025-11-12T09:34:18Z

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3"]
quantizations: []

ebezzam · 2025-11-12T09:40:07Z

@lashahub could you also add this training snippet: 8bf40fa

to the model page: https://huggingface.co/nvidia/audio-flamingo-3-hf

github-actions · 2025-11-12T09:43:14Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

ArthurZucker

Kudos everyone and thanks @eustlb and @ebezzam very clean 😉

ebezzam · 2025-11-12T13:06:07Z

run-slow: audioflamingo3

github-actions · 2025-11-12T13:06:24Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

github-actions · 2025-11-12T13:07:23Z

This comment contains run-slow, running the specified jobs:

models: ["models/audioflamingo3"]
quantizations: []

github-actions · 2025-11-12T13:17:11Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

github-actions · 2025-11-12T14:05:04Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

ebezzam

Thanks @lashahub and @Sreyan88 for the great work, and @eustlb and @ArthurZucker for the feedback 🤗

Merging!

github-actions · 2025-11-12T14:07:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audioflamingo3, auto, voxtral

HuggingFaceDocBuilderDev · 2025-11-12T14:16:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lashahub added 9 commits July 30, 2025 19:09

Audio Flamingo 3 initial integration

681c841

Added local Qwen

1d940af

Moving to AF3

d3a949e

Loading directly from HF

bfc7482

Formatting

4e9816e

Merge branch 'huggingface:main' into main

2fb5538

Merge branch 'main' of https://github.com/jsalt-2025/transformers

83b65bd

add snapshot_download

acf4b2b

Loading from hub

2831f4d

ebezzam added the Audio label Aug 20, 2025

ArthurZucker added the New model label Aug 21, 2025

ArthurZucker requested a review from ebezzam August 21, 2025 10:17

lashahub added 5 commits August 25, 2025 11:08

Merge branch 'huggingface:main' into audioflamingo3

ca630e6

Merge branch 'huggingface:main' into audioflamingo3

566ed26

Import gating

1acc2b1

Merge branch 'audioflamingo3' of https://github.com/jsalt-2025/transf…

4b64583

…ormers into audioflamingo3

Merge branch 'huggingface:main' into audioflamingo3

e5ba734

ebezzam requested changes Aug 25, 2025

View reviewed changes

lashahub added 12 commits August 27, 2025 13:51

Pass audio arrays directly

ca7ee0d

Merge branch 'huggingface:main' into audioflamingo3

c90900b

Merge branch 'audioflamingo3' of https://github.com/jsalt-2025/transf…

03ac049

…ormers into audioflamingo3

Remove requires_backend

29bba83

Move constants to config.json

072610a

Remove redundancies

6b99b11

Separate tokenizer, cleaner from_pretrained

7d5c15b

Remove LlavaMetaModel

aaa68b4

Remove sound tower wrapper

7522068

Merged BasicSoundEncoder

3ee2bce

Some improvements

272f31b

Towards AudioFlamingo3

23df229

Keep model in bf16 for tests.

55dbc92

Update expected results for single.

fbbc1f6

ArthurZucker approved these changes Nov 12, 2025

View reviewed changes

Fix integration tests from runner.

b44e818

Merge branch 'main' into audioflamingo3

f0bc100

ebezzam self-requested a review November 12, 2025 13:59

Update reproducer, and dtype nits.

020218f

ebezzam approved these changes Nov 12, 2025

View reviewed changes

ebezzam enabled auto-merge (squash) November 12, 2025 14:06

Merge branch 'main' into audioflamingo3

7f1d2ed

ebezzam disabled auto-merge November 12, 2025 14:18

ydshieh merged commit 1709ed9 into huggingface:main Nov 12, 2025
21 of 23 checks passed

[models] Add AudioFlamingo3 integration #40290

[models] Add AudioFlamingo3 integration #40290

Uh oh!

Conversation

lashahub commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

1. Refactoring / reorganizing current files according to Transformers convention

2. Conversion script

3. Testing, documentation, etc

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

lashahub commented Nov 12, 2025

Uh oh!

ebezzam commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

CI Results

Uh oh!

ebezzam commented Nov 12, 2025

Uh oh!

ebezzam commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

ebezzam commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

CI Results

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

CI Results

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lashahub commented Aug 19, 2025 •

edited

Loading