-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image_size vs window_spec #225
Comments
Hi Ross, For the L model, the window spec results in ~30% more tokens per window and so attention could become costlier but does result in some (small) improvement in accuracy; if a user is using the L model, perhaps they are more interested in accuracy, so the window sizes are chosen to be a bit larger. In practice, depending on the hardware and kernels used, differences in speed and memory maybe end up being quite small. |
Hi @chayryali, I think what @rwightman was getting at is that for a 1024px input, the 4 stages of the image encoder will have feature maps with side lengths of 256, 128, 64, 32 and the last two window sizes of the non-large models don't tile into this cleanly. This makes extra padding steps necessary for the non-large models, so (slight) changes to the window size could lead to improvements. For example, here are some timings for different window sizes (using the v2-base+ model):
Incidentally, SAMv1 also used non-tiling sizes and can even benefit from modifying it (see v1 PR #594). Though from my testing, the v2 models are very sensitive to the stage 2 & 3 sizes (based on a visualization script, for anyone interested), so it seems additional training would be needed to support updated sizes. |
@chayryali what @heyoeyo said was excatly what I was getting at. |
The default image_size appears to be 1024 for all models, but the window spec for T, S & B+ appear to be optimimal for image sizes that are multiples of 224 (ie 896 or 1120 would be better), only the L will work with multiples of 256 w/ extra padding overhead for the windowing...
EDIT: for L, that should have read 'w/o' 'without extra padding overhead'
Curious if there was a reason for this?
The text was updated successfully, but these errors were encountered: