-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Removed duplicate convolution for DoRA #2153
base: main
Are you sure you want to change the base?
Conversation
Thanks for this PR. We're aware of this potential inefficiency but I think it's not as easy as re-using the base result. The reasoning is explained here. Back then, we were only dealing with linear layers but I'm certain that the same logic applies to convolutional layers. The good news is that this optimization is indeed possible if dropout is set to 0 or if we're in eval mode, see #2122. LMK what you think. |
Thanks for the clarification! Would it be possible to apply the dropout for DoRA similar to how LoRA handles it, i.e. in Also, what do you think of the fix for convolutional layers using the "groups" argument? |
Could you clarify what you mean here, maybe with a small code example? Note that we have to ensure that the implementation sticks with the specification of the original paper. When we have no dropout, though, we should be able to make the same optimization as in #2122 though.
I wasn't aware of the |
So the reasoning behind why the DoRA optimization is not possible when we use
After looking at the DoRA paper, I can't figure out why this is an issue. If we see DoRA as a Based on the paper, why would we need to compute the full-rank convolution again with a |
Note that when we have LoRA+DoRA+dropout, we ensure that dropout is consistently applied to the LoRA part and the "base_result" part. If we use the |
I think I understand your point. But if we look at LoRA (e.g. ln 1120 in layer.py) we see, that we also don't apply the lora_dropout to the
So my question is wether the "base_result" part even needs the dropout for DoRA. And if we do need the dropout in the "base_result" part, why do we not need it for LoRA? |
Exactly. The In your proposed code, we would instead use the Only if there is no dropout can we re-use the base You can also create a LoRA layer with DoRA and check that the outputs differ when dropout is applied between the old code and the suggested change (fix the seed to ensure that there is no randomness involved). |
Yes, I understand the reasoning and that my suggestion would produce a different output. I just could not find the reasoning for why the dropout needs to be applied like this in the DoRA paper. But I'll assume you're right. The reason why I am questioning this is because we do not seem to use the dropout in the same way for LoRA. So my question is: Why do we not apply the dropout to the
|
Okay, I understand now. Our calculation for DoRA is a bit more complicated than if we simply followed the equation from the paper. The reason is that we first calculate the base Trying to fit this into an equation, this is what I get (note that I omitted the scale for simplicity): If we did not have the dropout in the base result part, then the first 2 terms in the final equation, Not sure if @nbasyl would have time to check this. |
Right! And if we look at the implementation of LoRA (ln 1121 in layer.py), we see that there we also can't simplify the term Since the dropout we are talking about here is actually a |
Any updates on this? |
Not sure if Shih-Yang currently has time to look into this. In the meantime, how about opening a separate PR for the |
This pull request fixes two problems:
1. Duplicate Convolution in DoRA implementation for ConvNd Layers:
Since the base layer convolution is already computed in
layer.py
, we don't need to compute it again in thedora.py
. Computing it again doubles the FLOPs consumption during the forward pass resulting in significantly higher FLOPs overall. We can pass the result from the base layer computed inlayer.py
to the forward pass of the_DoraConvNdLayer
indora.py
and save computational resources.2. Bugfix for DoRA regarding Convolutional Layers using the Groups Argument:CNNs that for example use depthwise separable convolutional layers result in an error when applying DoRA. Adjusting the dimension of the
conv_layer
inlayer.py
fixes this issue.