Skip to content

[Question] Got bad performance after pretraining and finetuning LLaVA-1.5-7B with clip-vit-large-patch14 (224px resolution) #1899

@ThisisBillhe

Description

@ThisisBillhe

Question

Hi,

I ran a full llava-1.5-7b experiment on CLIP at 336 and 224 resolutions, and the results at 336 resolution were OK, but the results at 224 resolution were very poor.

Specifically, I got a MME score of 879 and a textvqa_val score of 10.45. The results of the 336px model is normal. So it seems not to be a problem with my data and code.

Anyone has ideas or similar results?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions