[Question] Got bad performance after pretraining and finetuning LLaVA-1.5-7B with clip-vit-large-patch14 (224px resolution)

### Question

Hi,

I ran a full llava-1.5-7b experiment on CLIP at 336 and 224 resolutions, and the results at 336 resolution were OK, but the results at 224 resolution were very poor. 

Specifically, I got a MME score of 879 and a textvqa_val score of 10.45. The results of the 336px model is normal. So it seems not to be a problem with my data and code.

Anyone has ideas or similar results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Got bad performance after pretraining and finetuning LLaVA-1.5-7B with clip-vit-large-patch14 (224px resolution) #1899

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Question] Got bad performance after pretraining and finetuning LLaVA-1.5-7B with clip-vit-large-patch14 (224px resolution) #1899

Description

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions