You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are a community of CV Engineers and we were reading Visual Prompting via Image Inpainting
We would like to ask a couple of questions:
Why did you create the dataset in that way? It is not similar to the final input and you could have created something way easier by taking normal CV segmentation datasets and composing the grid image.
This is the image in the paper for training, yet it is unclear how you do inference. Can you give me some pseudo code assuming we take as input $x$ (image) and $m$ the mask part that we will have to fill
How did you find the right z_i for each "patch" token coming from MAE?
Could you give us the intuition on why you are doing the training in this way and not directly predicting the patch tokens on the missing parts?
Thank you
Cheers,
Fra
The text was updated successfully, but these errors were encountered:
Hi there 👋
We are a community of CV Engineers and we were reading
Visual Prompting via Image Inpainting
We would like to ask a couple of questions:
Why did you create the dataset in that way? It is not similar to the final input and you could have created something way easier by taking normal CV segmentation datasets and composing the grid image.
This is the image in the paper for training, yet it is unclear how you do inference. Can you give me some pseudo code assuming we take as input$x$ (image) and $m$ the mask part that we will have to fill
How did you find the right z_i for each "patch" token coming from MAE?
Could you give us the intuition on why you are doing the training in this way and not directly predicting the patch tokens on the missing parts?
Thank you
Cheers,
Fra
The text was updated successfully, but these errors were encountered: