Skip to content

[Question] Changing the vision tower. #1913

@sovietes

Description

@sovietes

Question

If I replaced the vision tower with a smaller one — for instance, a CLIP model that outputs 577 features replaced by another that outputs 200 — would it be enough to retrain only the projector, or would the LLM also need to be fine-tuned?
Would the fact that the LLM was originally trained with 577-dimensional visual embeddings affect its inference capability (e.g., through mismatched attention projections)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions