Question
If I replaced the vision tower with a smaller one — for instance, a CLIP model that outputs 577 features replaced by another that outputs 200 — would it be enough to retrain only the projector, or would the LLM also need to be fine-tuned?
Would the fact that the LLM was originally trained with 577-dimensional visual embeddings affect its inference capability (e.g., through mismatched attention projections)?