Question
Hi,
I ran a full llava-1.5-7b experiment on CLIP at 336 and 224 resolutions, and the results at 336 resolution were OK, but the results at 224 resolution were very poor.
Specifically, I got a MME score of 879 and a textvqa_val score of 10.45. The results of the 336px model is normal. So it seems not to be a problem with my data and code.
Anyone has ideas or similar results?