-
Notifications
You must be signed in to change notification settings - Fork 31.2k
[models] Add AudioFlamingo3 integration #40290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ormers into audioflamingo3
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lashahub for your PR! This is very a exciting model to add to the Transformers library!
I see that you've taken inspiration from Llava, which makes sense as you combine modules of different modalities.
Most of my comments are about rearranging your modules so that they fit the Transformers convention, such that it will be more convenient for others to use and to test your new model. To that end, Llava's and this PR (of another audio model) might serve as a useful example to see what files will be added/modified.
Below are my suggested steps.
1. Refactoring / reorganizing current files according to Transformers convention
You can take inspiration from the above models for refactoring your configuration, modeling and processing files, specifically:
- Consolidating your configurations. From what I see you may need only one config like in Llava or two like in Dia for
AudioFlamingo3ConfigandAudioFlamingo3EncoderConfig. - Processor: you can take inspiration from Dia to group your feature extractor, text tokenizer, and audio tokenizer into a single component. Loading pre-trained feature extractors and LLMs can be directly handled by the processor without you (or the user) having to manually load the corresponding weights (like you've done below).
# Load components
fe = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v3")
tok = AutoTokenizer.from_pretrained(
llm_dir,
padding_side="right",
use_fast=True,
legacy=False,
)To this end, you'll need to create a model conversion script (with something like this) so the configuration files are generated for the processor to know where to pull the relevant models.
- The
modelingfile will diminish quite significantly because we won't apply the audio/text tokenizer inside it but rather in the processor, and the configuration file will also handle pulling the relevant LLM config.
2. Conversion script
This will be needed to convert your model weights / configuration to one that is compatible with the one defined in the above files.
This script can also handle uploading the Transformer compatible model and its configuration to the Hugging Face Hub. You can again take inspiration from Llava and Dia.
3. Testing, documentation, etc
Once your model implementation is consistent with other models implemented in Transformers, there's a lot of boilerplate code we can reuse to make using model convenient and to apply various testing suites. For example, you can see the docs, src/transformers/models/auto, and tests/models/dia folders of the Dia PR on how to prepare / modify the relevant files.
Hope that helps and let me know if you have any questions!
src/transformers/models/audioflamingo3/configuration_audioflamingo3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/audioflamingo3/configuration_audioflamingo3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/audioflamingo3/configuration_audioflamingo3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/audioflamingo3/modeling_audioflamingo3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/audioflamingo3/modeling_audioflamingo3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/audioflamingo3/modeling_audioflamingo3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/audioflamingo3/modeling_audioflamingo3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/audioflamingo3/modeling_audioflamingo3.py
Outdated
Show resolved
Hide resolved
…ormers into audioflamingo3
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, auto, voxtral |
|
@ebezzam The tests were converting the batch to |
|
run-slow: audioflamingo3 |
|
This comment contains models: ["models/audioflamingo3"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
run-slow: audioflamingo3 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, auto, voxtral |
|
This comment contains models: ["models/audioflamingo3"] |
|
@lashahub could you also add this training snippet: 8bf40fa to the model page: https://huggingface.co/nvidia/audio-flamingo-3-hf |
CI Results✅ No failing test specific to this PR 🎉 ! |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
run-slow: audioflamingo3 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, auto, voxtral |
|
This comment contains models: ["models/audioflamingo3"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, auto, voxtral |
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lashahub and @Sreyan88 for the great work, and @eustlb and @ArthurZucker for the feedback 🤗
Merging!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: audioflamingo3, auto, voxtral |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This PR adds support for AudioFlamingo3 (AF3) — NVIDIA’s open large audio language model capable of reasoning over speech, sounds, and music.
It introduces the following components:
AudioFlamingo3model classAudioFlamingo3Processorfor preprocessing text + audioWith this integration, AF3 can be loaded directly from the Hugging Face Hub: