Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image_size vs window_spec #225

Open
rwightman opened this issue Aug 16, 2024 · 3 comments
Open

image_size vs window_spec #225

rwightman opened this issue Aug 16, 2024 · 3 comments
Assignees

Comments

@rwightman
Copy link

rwightman commented Aug 16, 2024

The default image_size appears to be 1024 for all models, but the window spec for T, S & B+ appear to be optimimal for image sizes that are multiples of 224 (ie 896 or 1120 would be better), only the L will work with multiples of 256 w/ extra padding overhead for the windowing...

EDIT: for L, that should have read 'w/o' 'without extra padding overhead'

Curious if there was a reason for this?

@chayryali chayryali self-assigned this Aug 17, 2024
@chayryali
Copy link
Contributor

Hi Ross, For the L model, the window spec results in ~30% more tokens per window and so attention could become costlier but does result in some (small) improvement in accuracy; if a user is using the L model, perhaps they are more interested in accuracy, so the window sizes are chosen to be a bit larger. In practice, depending on the hardware and kernels used, differences in speed and memory maybe end up being quite small.

@heyoeyo
Copy link

heyoeyo commented Sep 2, 2024

Hi @chayryali, I think what @rwightman was getting at is that for a 1024px input, the 4 stages of the image encoder will have feature maps with side lengths of 256, 128, 64, 32 and the last two window sizes of the non-large models don't tile into this cleanly. This makes extra padding steps necessary for the non-large models, so (slight) changes to the window size could lead to improvements.

For example, here are some timings for different window sizes (using the v2-base+ model):

Window Sizes Encoder Time (ms)
8, 4, 14, 7 21.4 (original)
8, 4, 16, 8 20.3
8, 8, 8, 8 19.6
16, 16, 16, 16 19.5 (interesting!)
12, 12, 12, 12 21.8
2, 2, 2, 2 37.9

Incidentally, SAMv1 also used non-tiling sizes and can even benefit from modifying it (see v1 PR #594). Though from my testing, the v2 models are very sensitive to the stage 2 & 3 sizes (based on a visualization script, for anyone interested), so it seems additional training would be needed to support updated sizes.

@rwightman
Copy link
Author

rwightman commented Sep 2, 2024

@chayryali what @heyoeyo said was excatly what I was getting at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants