Skip to content

Question about embeddings #3

Open
@xxchauncey

Description

@xxchauncey

Hi,

Thank you for your great work! I have some questions about those head layers.:

Image
As image shows in the paper, I notice there is a "shared" LLM embeddings in the 1st. stage. However, in the code you released I found that for the LLM a lm_head is re-defined while for audio part a audio_feature_head is defined, and they seem to have no relation but only the same shape.

I doubt how can we guarantee the CTC loss help the audio modality to align to LLM text modality? In my view, I suppose that audio input should be aligned to the text input embedding space of LLM but not output embedding space so that LLM can easily "recognize" the audio input to some extend.

Hope to know your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions