Open
Description
Hi,
Thank you for your great work! I have some questions about those head layers.:
As image shows in the paper, I notice there is a "shared" LLM embeddings in the 1st. stage. However, in the code you released I found that for the LLM a lm_head is re-defined while for audio part a audio_feature_head is defined, and they seem to have no relation but only the same shape.
I doubt how can we guarantee the CTC loss help the audio modality to align to LLM text modality? In my view, I suppose that audio input should be aligned to the text input embedding space of LLM but not output embedding space so that LLM can easily "recognize" the audio input to some extend.
Hope to know your thoughts!
Metadata
Metadata
Assignees
Labels
No labels