-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize DoRA in eval
and no dropout
#2122
Optimize DoRA in eval
and no dropout
#2122
Conversation
src/peft/tuners/lora/layer.py
Outdated
if isinstance(dropout, nn.Identity): | ||
print("no dropout, optimize here") | ||
else: | ||
print("dropout, same ops") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan did you envision something like this?
My intuition was:
- Figure out whether there is dropout or not
- Use a flag for dropout
- Pass the flag to the
forward
or DoRA layers -- where I would need to skip the alignment step and reusex
(the base model outputs)
Let me know if I am on the right track.
Note: I could not figure out a way to catch if the model was in eval
mode. How would you have done it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the dropout check is valid as is. Regarding eval mode, I think that checking self.training
should work.
On how to proceed, my thinking was that if we find that we can make this optimization, we pass the base result as an additional argument to DoRA forward
(default for that argument being None
) and there, we use this base result if it's given and if not, we calculate it like we currently do. Could be that I'm missing something but that's my idea.
The good news is that since we have a working implementation, we can then compare the results using both approaches and it should be identical (of course not when there is dropout, but apart from that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat solution!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this task. I have a few comments:
I don't think we need the do_optimize
argument. When result
is passed, let's use it, else don't.
Also, let's refactor this a bit and avoid an early return. Instead, let's do something like:
if result is not None:
# let's also add a comment to explain
base_result = ...
else:
base _result = F.linear(x, transpose(weight, self.fan_in_fan_out))
Then in this line:
replace F.linear(x, transpose(weight, self.fan_in_fan_out))
by base_result
.
A small caveat I see with this approach is that we assume that we can just remove the bias to get the base result. This will work for normal nn.Linear
layers but may not work for other types. But let's maybe not worry about that for now.
We should also run an example with and without the optimization to ensure that we get the same results and also a speedup and memory improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. I left a comment where I think the calculation is not quite right.
Also, it would be great if you could check a DoRA example to see the changes caused by this PR. Probably one of the existing examples could be used. We should ensure that:
- The results are the same (assuming dropout being 0)
- Training should be faster
- Memory usage should be lower
src/peft/tuners/lora/dora.py
Outdated
bias = base_layer.bias | ||
if bias is not None: | ||
base_result = base_result - bias | ||
result_dora = mag_norm_scale * base_result + mag_norm_scale * lora_result * scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wait, should this not be the exact same calculation as in line 103? I.e. we should leave the condition after calculating the base_result
and then do the same calculation of dora_result
for both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that I follow. With the base_result
in place:
- We first subtract the bias
- Compute the
dora_result
where the scale thebase_result
withmag_norm_scale
But without the base_result
:
- We compute the
base_result
with the linear forward - Compute the
dora_result
where we scale thebase_result
with(1 - mag_norm_scale)
Aren't they going to be different for each case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I'm a bit confused, let's try to resolve this.
In the old code, we basically have:
dora_result = (mag_norm_scale - 1) * base_result + mag_norm_scale * lora_result * lora_scale
variable names slightly changed for clarity
My thinking is that the base_result
is either calculated right there (old code) or we use the base_result
that is being passed as an argument, but the basic equation stays the same.
Of course, as you correctly noted, the bias needs to be subtracted first and then added back in the latter case.
In the currently proposed code, in one case we calculate mag_norm_scale * base_result
and in the other (mag_norm_scale - 1) * base_result
. This looks inconsistent to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the changes as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change is still not correct. Here is my suggestion:
bias = None
if base_result is not None:
bias = base_layer.bias
if bias is not None:
result = result - bias
else:
base_result = F.linear(x, transpose(weight, self.fan_in_fan_out))
result_dora = (mag_norm_scale - 1) * base_result + mag_norm_scale * lora_result * scaling
if bias is not None:
result_dora = result + bias
This way, if base_result = None
, the computation is exactly the same as it was previously.
I believe the confusion may stem from my comment:
# result = mag_norm_scale * result + mag_norm_scale * lora_B(lora_A(x)) * scaling
This comment should have been:
# result = (mag_norm_scale - 1) * result + mag_norm_scale * lora_B(lora_A(x)) * scaling
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense!
About the unit test -- do you want me to add a new test file? Or add a test somewhere?
@BenjaminBossan could you help me with the initial thought for an example that explains the efficiency of the process. As you have said, you would like to see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the further work.
As you have said, you would like to see memory_usage and faster_training. I am not very sure how I could explain the efficiency of memory usage in a colab notebook. Any resource would be very helpful.
I think we don't need to be overly strict for this, we could just run an example or two, once before and once after the changes, and monitor memory usage and runtime. I can also do that once we're ready to test.
For unit tests, we should just ensure that they cover this new code path, measuring memory and runtime is not reliable enough for unit testing.
src/peft/tuners/lora/dora.py
Outdated
bias = base_layer.bias | ||
if bias is not None: | ||
base_result = base_result - bias | ||
result_dora = mag_norm_scale * base_result + mag_norm_scale * lora_result * scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change is still not correct. Here is my suggestion:
bias = None
if base_result is not None:
bias = base_layer.bias
if bias is not None:
result = result - bias
else:
base_result = F.linear(x, transpose(weight, self.fan_in_fan_out))
result_dora = (mag_norm_scale - 1) * base_result + mag_norm_scale * lora_result * scaling
if bias is not None:
result_dora = result + bias
This way, if base_result = None
, the computation is exactly the same as it was previously.
I believe the confusion may stem from my comment:
# result = mag_norm_scale * result + mag_norm_scale * lora_B(lora_A(x)) * scaling
This comment should have been:
# result = (mag_norm_scale - 1) * result + mag_norm_scale * lora_B(lora_A(x)) * scaling
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some lingering issues, as witnessed by the failing tests. Could you check my suggestion please?
@ariG23498 Could you please run |
Hi @ariG23498 and @BenjaminBossan, thanks for the effort, I have reviewed the code and found no issue with it. Please let me know or tag me if you need my help, thanks! |
Thanks @ariG23498 for the latest fixes and @nbasyl for the review. I did a small test using this DoRA script by calling:
(I changed grad acc steps from 16 to 2). For this to work, I had to propagate the DoRA changes from this PR to the bitsandbytes layers. What I found is:
I also monitored memory and it went down from 7557MiB to 7325MiB. So the final losses are not 100% identical, but I think it's within rounding error. Runtime was improved and memory usage slightly decreased with this PR. Overall, I believe these are nice results and we can continue with this PR. @ariG23498 could you please propagate the changes to the quantized LoRA layers types that support it. We could probably also document this to let users know that they should consider disabling dropout for DoRA training to benefit from this optimization, with some numbers to underline this. |
Thanks for the detailed reply @BenjaminBossan
Do you mean all the variants found here? Also I think it would be better to have the current change made to DoRA only, and then create another PR for the rest of the layers, WDYT? |
@ariG23498 @BenjaminBossan very nice PR. I learned a lot. @ariG23498 let me know if I can be of help to propagate the changes, maybe in separate PR. |
Yes, so what I mean is that e.g. in Other than that, let's add a mention of this optimization in the docs. Ideally, we can add some numbers. I can re-run the experiment mentioned above and give some definitive numbers if you want.
Thanks for the offer. As mentioned, let's try to get it into this PR. @ariG23498 up to you if/how you want to split up the work. |
@BenjaminBossan I have made the changes. @charchit7 thank you for the offer, but as this is a redundant piece of code, I thought it was better to make the changes myself. Please feel free to take up other issues and comment for collaboration 🤗 |
Thanks for the update. Let's also add it here: peft/src/peft/tuners/lora/tp_layer.py Lines 214 to 221 in 93ddb10
The other layer types don't seem to properly implement DoRA yet, so we can keep those to a separate PR. Would you also be so kind to add to the docs? |
Yes, I completely understand. Thank you, yes, will do :) |
Thanks for the updates. I'll re-run the script later, as the first test was only very short, to get some final numbers to report. |
Update: So I re-ran the script for a while longer and with a higher batch size (before it was 1), using:
I also set What I found is that training was 20% (wall time) to 23% (transformes reported time) faster. However, there no longer was any memory advantage, not quite sure what was different the first time around.
Losses aligned quite nicely, with only a rounding error level of difference: @ariG23498 Could you please update the docs accordingly (no need to mention the loss, as this is expected). |
@BenjaminBossan the docs have been updated! The benchmark results are crazy. |
docs/source/developer_guides/lora.md
Outdated
DoRA is optimized (computes faster and takes less memory) for models in the evaluation mode, or when dropout is set to 0. We reuse the | ||
base result at those times to get the speedup. Running [dora finetuning](https://github.com/huggingface/peft/blob/main/examples/dora_finetuning/dora_finetuning.py) with `CUDA_VISIBLE_DEVICES=0 time python dora_finetuning.py --quantize --lora_dropout 0 --use_dora` these were the observations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the command according to reflect the last one I used (including the batch size and grad acc)? Also, no need for line breaks in the docs. You could also mention that this was tested on a 4090.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this DoRA speed improvement and applying it to all relevant layers (+ some other minor fixes you did in this PR). LGTM.
Let's keep in mind that we can still roll this out to more layer types (see #2153) and not all quants implement DoRA yet.
Fixes #2107