Skip to content

Conversation

@aireenmei
Copy link
Collaborator

@aireenmei aireenmei commented Jan 15, 2026

Description

Original author @eitanporat in #2726

Revise according to the original PR comments:

  • rename flags for consistency and clean up usage
  • make AudioEncoder an NNX module
  • remove SinusoidsPositionEmbedding, replaced with expanded existing PositionalEmbedding module
  • remove Qwen3OmniAudioModel in qwen3.py since we use the AudioEncoder class in decoder.py instead
  • add precision related flags

Tests

tests/check_qwen3_omni_audio_vs_reference.py all pass

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link

codecov bot commented Jan 15, 2026

@aireenmei aireenmei force-pushed the aireen/qwen-audio branch 2 times, most recently from 580f4a2 to 875e674 Compare January 15, 2026 02:54
return embeddings


class AudioEncoder(nnx.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing it as NNX!


if audio_embeddings is not None and cfg.use_audio:
if cfg.model_name in ["qwen3-omni-30b-a3b"]:
y = multimodal_utils.merge_mm_embeddings(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it support "video/audio interleaving" at the moment? Or it's in following PRs?

image_embeddings=image_embeddings,
image_masks=encoder_image_masks,
audio_embeddings=audio_embeddings,
audio_masks=audio_masks,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can consolidate all multimodal inputs into a class and pass this single variable to the model. But that can be a follow up later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this!

Comment on lines +1958 to +1960
config: Config
proj1: DenseGeneral
proj2: DenseGeneral
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these three lines necessary?

self,
hidden_states: Array,
deterministic: bool = False,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add an input/output shape comment here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer to create this standalone check, or merge it with https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/check_qwen3_embedding_vs_reference.py?

@hengtaoguo
Copy link
Collaborator

Thank you for the great work!

Copy link
Collaborator

@NicoGrande NicoGrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

image_embeddings=image_embeddings,
image_masks=encoder_image_masks,
audio_embeddings=audio_embeddings,
audio_masks=audio_masks,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants