[Question] Changing the vision tower.

### Question

If I replaced the vision tower with a smaller one — for instance, a CLIP model that outputs 577 features replaced by another that outputs 200 — would it be enough to retrain only the projector, or would the LLM also need to be fine-tuned?
Would the fact that the LLM was originally trained with 577-dimensional visual embeddings affect its inference capability (e.g., through mismatched attention projections)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Changing the vision tower. #1913

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Question] Changing the vision tower. #1913

Description

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions