Question about embeddings

Hi,

Thank you for your great work! I have some questions about those head layers.:

![Image](https://github.com/user-attachments/assets/2a445ff2-71db-40f3-adbd-9ee21172fcf3)
As image shows in the paper, I notice there is a "shared" LLM embeddings in the 1st. stage. However, in the code you released I found that for the LLM a _lm_head_ is re-defined while for audio part a _audio_feature_head_ is defined, and they seem to have no relation but only the same shape. 

I doubt how can we guarantee the CTC loss help the audio modality to align to LLM text modality? In my view, I suppose that audio input should be aligned to the text input embedding space of LLM but not output embedding space so that LLM can easily "recognize" the audio input to some extend.

Hope to know your thoughts!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about embeddings #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about embeddings #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions