Why is the result of drop_path divided by the keep_prob? #895

liu09114 · 2021-09-28T08:11:35Z

liu09114
Sep 28, 2021

https://github.com/rwightman/pytorch-image-models/blob/3f9959cdd28cb959980abf81fc4cf34f32e18399/timm/models/layers/drop.py#L140
Why the result of drop_path is divided by the keep_prob?
In my opinion, It is executed for each sample? Drop one path for one sample shouldn't have influence on another sample.
If you want to use division, this should be used on the shortcut branch, but not on the residual branch.

rwightman · 2021-09-28T20:39:23Z

rwightman
Sep 28, 2021
Maintainer

@liu09114 that was the implementation for original efficientnet paper https://github.com/tensorflow/tpu/blob/298d1fa98638f302ab9df34d9d26bbded7220e8b/models/official/efficientnet/utils.py#L276

The goal was to reproduce that so I kept it. It has no impact for inference, but you could argue for training it does better preserve the batch stats.

I recall that for the NFNet paper/impl they don't scale by default, https://github.com/deepmind/deepmind-research/blob/master/nfnets/base.py#L202 ... I was planning to try this at some point side-by-side but haven't had a chance.

0 replies

liu09114 · 2021-09-29T08:58:13Z

liu09114
Sep 29, 2021
Author

Thanks, your reply solved my problem.

0 replies

rwightman · 2021-09-29T16:53:59Z

rwightman
Sep 29, 2021
Maintainer

@liu09114 I've moved this to discussion so that it's more visible to others who might have this question.

0 replies

tgeorgy · 2024-07-17T23:33:07Z

tgeorgy
Jul 17, 2024

@rwightman other than original point - it is scaling inputs during training but not during inference. This is obviously wrong. People are lucky to use layernorms most of the time after that but this implementation is now spreaded to many repos including dinov2.

4 replies

rwightman Jul 18, 2024
Maintainer

@tgeorgy it's actually used in a whole lot of networks where there is no norm layer immediately after, most convnets that have used it fall in that category. If you want it changed, run some comparisons and show us which is better...

While less justified than scaling in dropout still feel there are arguments for, batch norm for instance, if used will have messed up batch stats when samples are dropped, since the # samples will always be the same as far as BN is aware (downstream norm not aware). With LayerNorm only, or weight std in NfNets that argument is gone though.

rwightman Jul 18, 2024
Maintainer

As mentioned I did intend to do comparisons at some point, and have the bool to disable that in there, just never got around to it, so data points welcome and can make that bool easier to control per network if there's a clear win.

tgeorgy Jul 18, 2024

Oh, downstream batchnorms case makes sense. Maybe I should take back the layernorm point. It is usually used in a residual branch like x + droppath(f(x)) so this scaling changes the x to f(x) proportion even when there's a layernorm after that.

Will see if I have a chance to find an example with drop_rate > 0 and if I have enough capacity to test. Thanks.

tgeorgy Jul 28, 2024

After thinking about it more - I should probably take the whole argument back. If we think about the scaling in isolation - it is probably fine because for drop rate 0.5 we sample between x and x+r*2 during training and interpolate between the two by doing x+r during test. And this feels totally reasonable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the result of drop_path divided by the keep_prob? #895

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Why is the result of drop_path divided by the keep_prob? #895

liu09114 Sep 28, 2021

Replies: 4 comments · 4 replies

rwightman Sep 28, 2021 Maintainer

liu09114 Sep 29, 2021 Author

rwightman Sep 29, 2021 Maintainer

tgeorgy Jul 17, 2024

rwightman Jul 18, 2024 Maintainer

rwightman Jul 18, 2024 Maintainer

tgeorgy Jul 18, 2024

tgeorgy Jul 28, 2024

liu09114
Sep 28, 2021

Replies: 4 comments 4 replies

rwightman
Sep 28, 2021
Maintainer

liu09114
Sep 29, 2021
Author

rwightman
Sep 29, 2021
Maintainer

tgeorgy
Jul 17, 2024

rwightman Jul 18, 2024
Maintainer

rwightman Jul 18, 2024
Maintainer