-
Notifications
You must be signed in to change notification settings - Fork 2.1k
FEAT add DeLoRA #2780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT add DeLoRA #2780
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR to add DeLoRA to PEFT. The PR is already quite advanced and in a nice state, good job with that.
I did a review of the PR and added a few comments, please check those. But while working on this, I also noticed some more general areas for discussion, which should be addressed first:
Naming
In general, I would suggest to rename variables with DeLoRA
to Delora
or DeLora
. Not only is that easier to type, it's also consistent with the other names, e.g. LoraConfig
.
Initialization
I think a lot of complexity of the implementation comes from the initialization with use_residual_init=True
. This is the reason why we need the delora_initial_*
parameters and update methods like _load_from_state_dict
, right? I wonder if we can think of a way to significantly reduce this complexity.
First of all, I still need to get a good understanding why this is required in the first place. Do I get it right that normally, initializing DeLoRA cannot be done in a way that makes it an identity transform? So there is no choice of delora_A
and delora_B
that would make their contribution zero? For me, this would be the ideal solution.
Second, how important is it to have the use_residual_init=True
option? Is it better for training or could we just drop it entirely? I know that this would require some adjustments to the unit tests, as some tests assume that the adapter is an identity transform without training. But these adjustments can be done and it is not a requirement for PEFT methods to have an initialization option that comes down to identity.
If this option is not strictly needed for performance reasons but mainly to satisfy tests, I would rather remove it, eliminate a lot of the complexity associated with it, and then update the tests.
Device and dtype handling
There is a lot of code that deals with moving the DeLoRA parameters to the correct device and dtype, such as:
A = self.delora_A[a].to(dtype=comp_dtype, device=device)
B = self.delora_B[a].to(dtype=comp_dtype, device=device)
alpha = self.delora_alpha[a].to(dtype=comp_dtype, device=device)
Would it not make sense to initialize them with the correct device and dtype right from the start (as is normally done for other PEFT methods) instead of casting and moving them each time, which can be costly? Or is this some sort of memory optimization?
Other
Those are the big points I found while reviewing and they should be the focus for the next iteration. For the final PR, I would also like to see an example (it could be a copy of an existing example with modifications to use DeLoRA) and setting for the PEFT method comparison suite.
Finally, please always call make style
before pushing your changes.
src/peft/tuners/delora/config.py
Outdated
r (`int`): | ||
DeLoRA rank. | ||
alpha (`int`): | ||
The alpha parameter for DeLoRA boundary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: IIUC, this corresponds to lambda in the paper, right? Why was it renamed? Because the keyword is protected in Python or to make it more similar to lora_alpha
? But this alpha has a different role than alpha has for LoRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I didn't use lambda because it was protected. But indeed using alpha could be confusing. For consistency with the paper I changed it to delora_lambda
and lambda_
now
Hi @BenjaminBossan, thank you for thorough review! I made a few changes following your suggestions. I will address below your main concerns Naming: I agree, I now modified everything to Initialization: the method requires to be identity-initialized to work properly. To do this while avoiding divisions by zeros in the normalization (which could lead to instabilities), the idea is to add and subtract the same initial weights W + BA - BA and keep one of the copies frozen ( Device and dtype handling: a lot of these were indeed redundant/unnecessary. I simplified and tried to remove most of them now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I did a light pass over the changes and added some comments. Some of my previous comments were not addressed yet, so please take a look at them too.
Regarding the initialization, thanks for explaining. So IIUC, we would want to have W' + BA = W_base because it stabilizes training. We cannot set B=0 because of the normalization step. I wonder if there is still another way to get an identity initialization, or at least something close to it, without either saving copies or modifying the base weight, but choosing A and B in a careful way. Perhaps there is a possibility through orthonormal initialization with QR decomposition, but I'm not sure how this could be preserved with the normalization step.
If it's not possible, I think we can live with the extra complexity, but I believe it also means that, for use_residual_init=True
, we cannot correctly deal with more than one DoLoRA adapter at a time, is that right? So we would need to implement a check in _check_new_adapter_config
that at most one adapter with use_residual_init=True
is being used.
For use_residual_init=False
, I'd like to see what it entails in terms of overhead. Perhaps we can test it out with the PEFT method comparison suite.
Thank you for the feedbacks. Now I should have addressed most of the comments, please let me know if something is missing. Regarding the initialization I am not sure what could be the best approach. Maybe B=0 could still be used by setting a large enough min value in torch.clamp... As a next step, I will try to setup some experiments and check this zero initialization, the overhead of using use_residual_init=False, etc. Regarding using multiple adapters, it would be indeed not possible with use_residual_init=True. For now, I added a check in _check_new_adapter_config that should check for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I found a couple of new issues, but nothing big. Also, please merge with/rebase on the latest PEFT main branch and resolve any merge conflicts that may occur.
Regarding the initialization I am not sure what could be the best approach. Maybe B=0 could still be used by setting a large enough min value in torch.clamp... As a next step, I will try to setup some experiments and check this zero initialization, the overhead of using use_residual_init=False, etc.
Great, LMK if you find something.
src/peft/tuners/delora/layer.py
Outdated
if adapter_name in self.delora_initial_lambda: | ||
del self.delora_initial_lambda[adapter_name] | ||
|
||
def _move_adapter_to_device_of_base_layer(self, adapter_name: str, device: Optional[torch.device] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we cannot use the _move_adapter_to_device_of_base_layer
method from the parent class (BaseTunerLayer
). Could you please add a short comment on what is different here?
src/peft/tuners/delora/layer.py
Outdated
if isinstance(base_layer, nn.MultiheadAttention): | ||
base_layer = base_layer.out_proj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MHA is not supported by DeLoRA so this can be removed.
src/peft/tuners/delora/layer.py
Outdated
if key.startswith(prefix): | ||
# Remove the prefix to get the relative key | ||
relative_key = key[len(prefix) :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relative_key = key.removeprefix(prefix)
should do the job.
src/peft/tuners/delora/layer.py
Outdated
module_state_dict[relative_key] = value | ||
|
||
# Manually load ParameterDict and BufferDict parameters | ||
for param_name in ["delora_A", "delora_B", "delora_lambda"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use self.adapter_layer_names
here, right?
src/peft/tuners/delora/model.py
Outdated
# TODO: there should be a check if any of the existing adapters actually has bias != "none", or else the check | ||
# does not fully correspond to the error message. | ||
if (len(self.peft_config) > 1) and (config.bias != "none"): | ||
raise ValueError( | ||
f"{self.__class__.__name__} supports only 1 adapter with bias. When using multiple adapters, " | ||
"set bias to 'none' for all adapters." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's replace this with:
# TODO: there should be a check if any of the existing adapters actually has bias != "none", or else the check | |
# does not fully correspond to the error message. | |
if (len(self.peft_config) > 1) and (config.bias != "none"): | |
raise ValueError( | |
f"{self.__class__.__name__} supports only 1 adapter with bias. When using multiple adapters, " | |
"set bias to 'none' for all adapters." | |
) | |
super()._check_new_adapter_config(config) |
src/peft/tuners/delora/model.py
Outdated
"set use_residual_init to False for all adapters." | ||
) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently had a PR (#2771) that allows us to greatly reduce the number of methods required to implement here. You can simply remove the following methods:
_check_target_module_exists
_mark_only_adapters_as_trainable
__getattr__
get_peft_config_as_dict
_set_adapter_layers
enable_adapter_layers
disable_adapter_layers
set_adapter
_prepare_adapter_config
_unload_and_optionally_merge
merge_and_unload
unload
On the flip side, you need to add:
tuner_layer_cls = DeloraLayer
target_module_mapping = TRANSFORMERS_MODELS_TO_DELORA_TARGET_MODULES_MAPPING
as class attributes, i.e. directly below prefix: str = "delora_"
.
For this to work, you also need to merge with/rebase on the latest main branch.
tests/test_initialization.py
Outdated
def test_disable_enable_no_change(self, data): | ||
base = self.get_model() | ||
cfg = DeloraConfig(target_modules=["lin0"], r=8, lambda_=8) | ||
model = get_peft_model(base, cfg) | ||
|
||
y0 = model(data) | ||
model.disable_adapter_layers() | ||
y1 = model(data) | ||
model.enable_adapter_layers() | ||
y2 = model(data) | ||
|
||
assert torch.allclose(y0, y1, atol=1e-6, rtol=1e-6) | ||
assert torch.allclose(y0, y2, atol=1e-6, rtol=1e-6) | ||
|
||
def test_merge_and_unload_same_outputs(self, data): | ||
base = self.get_model() | ||
cfg = DeloraConfig(target_modules=["lin0"], r=8, lambda_=8) | ||
model = get_peft_model(base, cfg) | ||
y_before = model(data) | ||
|
||
merged = model.merge_and_unload() | ||
merged.eval() | ||
y_after = merged(data) | ||
assert torch.allclose(y_before, y_after, atol=1e-6, rtol=1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should already be well covered by the existing tests and can be removed here.
src/peft/tuners/delora/model.py
Outdated
"set bias to 'none' for all adapters." | ||
) | ||
# if there are multiple adapters, residual init cannot be used | ||
if (len(self.peft_config) > 1) and (config.use_residual_init): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would it work if I first add a DeLoRA adapter with use_residual_init=False
and then a DeLoRA adapter with use_residual_init=True
? I think it should work but this check would result in an error. IIUC, the correct check is that there cannot be more than one config with use_residual_init=True
.
…ice moving, others..
…th init_weights and multiple adapters
…e for reverting to base weights
…dify tests, add MetaMathQA experiment configs
Hi @BenjaminBossan, thank you for your review. I now pushed several changes (last two commits) and rebased to the latest version of the code. The main modification is that I removed the residual initialization in favor of the zero initialization of B (making DeLoRA init similar to LoRA). Because of this, given all the complications that it was creating (requiring special version of functions, or limiting its usability with multiple adapters), I removed the residual initialization and all related modifications (so use_residual_init, delora_initial_A/B, etc). So now the method should be much simpler to handle. Please let me know what do you think about this, if maybe it was better to keep a residual initialization version, or if it's ok to keep it simple. I also removed unnecessary methods and tests as you suggested. Since I removed use_residual_init and related methods, some modifications were not needed anymore. I also added some configs for experiments on MetaMathQA. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this last batch of changes.
The main modification is that I removed the residual initialization in favor of the zero initialization of B (making DeLoRA init similar to LoRA).
Great, thanks for testing this out. Indeed, the code is much simpler now and therefore, I think this is a much nicer solution than what we had before.
For this min value I tried different choices, and since nans can still appear in float16 for values below 1e-4, while leading to no significant performance differences in full precision, I set 1e-4 as default value.
Do you foresee different use cases requiring different values here? If yes, you could make it configurable through DeloraConfig
with the default being 1e-4.
Apart from that, I only found a very small issue which I commented on. Once you're finished, remember to call make style
before pushing.
nn.init.kaiming_uniform_(self.delora_A[adapter_name], a=math.sqrt(5)) | ||
nn.init.kaiming_uniform_(self.delora_B[adapter_name], a=math.sqrt(5)) | ||
|
||
self.delora_lambda[adapter_name].data.fill_(float(delora_lambda)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is redundant with https://github.com/huggingface/peft/pull/2780/files#diff-1acd6f7d805393f2f46612371a37e1d6511b12b40cc3a5817c2e61524e334a8cR120, right?
I implemented a first version, based on #2760