Replies: 4 comments 4 replies
-
@liu09114 that was the implementation for original efficientnet paper https://github.com/tensorflow/tpu/blob/298d1fa98638f302ab9df34d9d26bbded7220e8b/models/official/efficientnet/utils.py#L276 The goal was to reproduce that so I kept it. It has no impact for inference, but you could argue for training it does better preserve the batch stats. I recall that for the NFNet paper/impl they don't scale by default, https://github.com/deepmind/deepmind-research/blob/master/nfnets/base.py#L202 ... I was planning to try this at some point side-by-side but haven't had a chance. |
Beta Was this translation helpful? Give feedback.
-
Thanks, your reply solved my problem. |
Beta Was this translation helpful? Give feedback.
-
@liu09114 I've moved this to discussion so that it's more visible to others who might have this question. |
Beta Was this translation helpful? Give feedback.
-
@rwightman other than original point - it is scaling inputs during training but not during inference. This is obviously wrong. People are lucky to use layernorms most of the time after that but this implementation is now spreaded to many repos including dinov2. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/rwightman/pytorch-image-models/blob/3f9959cdd28cb959980abf81fc4cf34f32e18399/timm/models/layers/drop.py#L140
Why the result of drop_path is divided by the keep_prob?
In my opinion, It is executed for each sample? Drop one path for one sample shouldn't have influence on another sample.
If you want to use division, this should be used on the shortcut branch, but not on the residual branch.
Beta Was this translation helpful? Give feedback.
All reactions