Open
Description
I hope to input a screenshot of a webpage/app and have the model help me detect some components in the image, such as icons, textboxes, etc., but it is completely unable to do so. The only successful attempt is that when prompt="icon" is entered, the model can detect a certain number of icons/images. In other cases, it is almost impossible to detect anything. I want to know if the model has not been trained specifically in this area, or if my prompt is not written well enough. There is also a special case where prompt="icon. textbox". If a prompt has one more word (textbox), almost nothing can be detected.
Metadata
Metadata
Assignees
Labels
No labels