Skip to content

Conversation

mwbini
Copy link

@mwbini mwbini commented Sep 10, 2025

I implemented a first version, based on #2760

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR to add DeLoRA to PEFT. The PR is already quite advanced and in a nice state, good job with that.

I did a review of the PR and added a few comments, please check those. But while working on this, I also noticed some more general areas for discussion, which should be addressed first:

Naming

In general, I would suggest to rename variables with DeLoRA to Delora or DeLora. Not only is that easier to type, it's also consistent with the other names, e.g. LoraConfig.

Initialization

I think a lot of complexity of the implementation comes from the initialization with use_residual_init=True. This is the reason why we need the delora_initial_* parameters and update methods like _load_from_state_dict, right? I wonder if we can think of a way to significantly reduce this complexity.

First of all, I still need to get a good understanding why this is required in the first place. Do I get it right that normally, initializing DeLoRA cannot be done in a way that makes it an identity transform? So there is no choice of delora_A and delora_B that would make their contribution zero? For me, this would be the ideal solution.

Second, how important is it to have the use_residual_init=True option? Is it better for training or could we just drop it entirely? I know that this would require some adjustments to the unit tests, as some tests assume that the adapter is an identity transform without training. But these adjustments can be done and it is not a requirement for PEFT methods to have an initialization option that comes down to identity.

If this option is not strictly needed for performance reasons but mainly to satisfy tests, I would rather remove it, eliminate a lot of the complexity associated with it, and then update the tests.

Device and dtype handling

There is a lot of code that deals with moving the DeLoRA parameters to the correct device and dtype, such as:

A = self.delora_A[a].to(dtype=comp_dtype, device=device)
B = self.delora_B[a].to(dtype=comp_dtype, device=device)
alpha = self.delora_alpha[a].to(dtype=comp_dtype, device=device)

Would it not make sense to initialize them with the correct device and dtype right from the start (as is normally done for other PEFT methods) instead of casting and moving them each time, which can be costly? Or is this some sort of memory optimization?

Other

Those are the big points I found while reviewing and they should be the focus for the next iteration. For the final PR, I would also like to see an example (it could be a copy of an existing example with modifications to use DeLoRA) and setting for the PEFT method comparison suite.

Finally, please always call make style before pushing your changes.

r (`int`):
DeLoRA rank.
alpha (`int`):
The alpha parameter for DeLoRA boundary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: IIUC, this corresponds to lambda in the paper, right? Why was it renamed? Because the keyword is protected in Python or to make it more similar to lora_alpha? But this alpha has a different role than alpha has for LoRA.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I didn't use lambda because it was protected. But indeed using alpha could be confusing. For consistency with the paper I changed it to delora_lambda and lambda_ now

@mwbini
Copy link
Author

mwbini commented Sep 11, 2025

Hi @BenjaminBossan, thank you for thorough review! I made a few changes following your suggestions. I will address below your main concerns

Naming: I agree, I now modified everything to Delora*

Initialization: the method requires to be identity-initialized to work properly. To do this while avoiding divisions by zeros in the normalization (which could lead to instabilities), the idea is to add and subtract the same initial weights W + BA - BA and keep one of the copies frozen (delora_initial_A, delora_initial_B). With use_residual_init=False, the frozen copies are directly used in the forward pass, and merging/unmerging and others should work similarly to other peft methods (on the negative, these initial weights need to be kept on device, and require extra computations). With use_residual_init=True instead, W - BA are merged once at the beginning, and the forward pass simply uses the trainable delora_A delora_B. This saves some computations, and avoids keeping on device delora_initial_A, delora_initial_B, but indeed adds several complications when merging/unmerging etc, requiring before each operation to add back the initial BA (offloaded to cpu) from the base weights W (so to correctly go back to the base model). I tried to describe the role of use_residual_init a bit more in the config file

Device and dtype handling: a lot of these were indeed redundant/unnecessary. I simplified and tried to remove most of them now

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I did a light pass over the changes and added some comments. Some of my previous comments were not addressed yet, so please take a look at them too.

Regarding the initialization, thanks for explaining. So IIUC, we would want to have W' + BA = W_base because it stabilizes training. We cannot set B=0 because of the normalization step. I wonder if there is still another way to get an identity initialization, or at least something close to it, without either saving copies or modifying the base weight, but choosing A and B in a careful way. Perhaps there is a possibility through orthonormal initialization with QR decomposition, but I'm not sure how this could be preserved with the normalization step.

If it's not possible, I think we can live with the extra complexity, but I believe it also means that, for use_residual_init=True, we cannot correctly deal with more than one DoLoRA adapter at a time, is that right? So we would need to implement a check in _check_new_adapter_config that at most one adapter with use_residual_init=True is being used.

For use_residual_init=False, I'd like to see what it entails in terms of overhead. Perhaps we can test it out with the PEFT method comparison suite.

@mwbini
Copy link
Author

mwbini commented Sep 22, 2025

Thank you for the feedbacks. Now I should have addressed most of the comments, please let me know if something is missing.

Regarding the initialization I am not sure what could be the best approach. Maybe B=0 could still be used by setting a large enough min value in torch.clamp... As a next step, I will try to setup some experiments and check this zero initialization, the overhead of using use_residual_init=False, etc.

Regarding using multiple adapters, it would be indeed not possible with use_residual_init=True. For now, I added a check in _check_new_adapter_config that should check for this

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I found a couple of new issues, but nothing big. Also, please merge with/rebase on the latest PEFT main branch and resolve any merge conflicts that may occur.

Regarding the initialization I am not sure what could be the best approach. Maybe B=0 could still be used by setting a large enough min value in torch.clamp... As a next step, I will try to setup some experiments and check this zero initialization, the overhead of using use_residual_init=False, etc.

Great, LMK if you find something.

if adapter_name in self.delora_initial_lambda:
del self.delora_initial_lambda[adapter_name]

def _move_adapter_to_device_of_base_layer(self, adapter_name: str, device: Optional[torch.device] = None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we cannot use the _move_adapter_to_device_of_base_layer method from the parent class (BaseTunerLayer). Could you please add a short comment on what is different here?

Comment on lines 235 to 236
if isinstance(base_layer, nn.MultiheadAttention):
base_layer = base_layer.out_proj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MHA is not supported by DeLoRA so this can be removed.

Comment on lines 518 to 520
if key.startswith(prefix):
# Remove the prefix to get the relative key
relative_key = key[len(prefix) :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relative_key = key.removeprefix(prefix) should do the job.

module_state_dict[relative_key] = value

# Manually load ParameterDict and BufferDict parameters
for param_name in ["delora_A", "delora_B", "delora_lambda"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use self.adapter_layer_names here, right?

Comment on lines 64 to 70
# TODO: there should be a check if any of the existing adapters actually has bias != "none", or else the check
# does not fully correspond to the error message.
if (len(self.peft_config) > 1) and (config.bias != "none"):
raise ValueError(
f"{self.__class__.__name__} supports only 1 adapter with bias. When using multiple adapters, "
"set bias to 'none' for all adapters."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace this with:

Suggested change
# TODO: there should be a check if any of the existing adapters actually has bias != "none", or else the check
# does not fully correspond to the error message.
if (len(self.peft_config) > 1) and (config.bias != "none"):
raise ValueError(
f"{self.__class__.__name__} supports only 1 adapter with bias. When using multiple adapters, "
"set bias to 'none' for all adapters."
)
super()._check_new_adapter_config(config)

"set use_residual_init to False for all adapters."
)

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently had a PR (#2771) that allows us to greatly reduce the number of methods required to implement here. You can simply remove the following methods:

  • _check_target_module_exists
  • _mark_only_adapters_as_trainable
  • __getattr__
  • get_peft_config_as_dict
  • _set_adapter_layers
  • enable_adapter_layers
  • disable_adapter_layers
  • set_adapter
  • _prepare_adapter_config
  • _unload_and_optionally_merge
  • merge_and_unload
  • unload

On the flip side, you need to add:

    tuner_layer_cls = DeloraLayer
    target_module_mapping = TRANSFORMERS_MODELS_TO_DELORA_TARGET_MODULES_MAPPING

as class attributes, i.e. directly below prefix: str = "delora_".

For this to work, you also need to merge with/rebase on the latest main branch.

Comment on lines 1902 to 1925
def test_disable_enable_no_change(self, data):
base = self.get_model()
cfg = DeloraConfig(target_modules=["lin0"], r=8, lambda_=8)
model = get_peft_model(base, cfg)

y0 = model(data)
model.disable_adapter_layers()
y1 = model(data)
model.enable_adapter_layers()
y2 = model(data)

assert torch.allclose(y0, y1, atol=1e-6, rtol=1e-6)
assert torch.allclose(y0, y2, atol=1e-6, rtol=1e-6)

def test_merge_and_unload_same_outputs(self, data):
base = self.get_model()
cfg = DeloraConfig(target_modules=["lin0"], r=8, lambda_=8)
model = get_peft_model(base, cfg)
y_before = model(data)

merged = model.merge_and_unload()
merged.eval()
y_after = merged(data)
assert torch.allclose(y_before, y_after, atol=1e-6, rtol=1e-6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should already be well covered by the existing tests and can be removed here.

"set bias to 'none' for all adapters."
)
# if there are multiple adapters, residual init cannot be used
if (len(self.peft_config) > 1) and (config.use_residual_init):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would it work if I first add a DeLoRA adapter with use_residual_init=False and then a DeLoRA adapter with use_residual_init=True? I think it should work but this check would result in an error. IIUC, the correct check is that there cannot be more than one config with use_residual_init=True.

@mwbini
Copy link
Author

mwbini commented Oct 12, 2025

Hi @BenjaminBossan, thank you for your review. I now pushed several changes (last two commits) and rebased to the latest version of the code. The main modification is that I removed the residual initialization in favor of the zero initialization of B (making DeLoRA init similar to LoRA).
I tested if setting a high enough minimal value in torch.clamp in the A/B normalization would avoid creating issues (by testing on both GLUE and MetaMathQA), and it seems that indeed issues do not appear, and performance are very similar to using the residual initialization. For this min value I tried different choices, and since nans can still appear in float16 for values below 1e-4, while leading to no significant performance differences in full precision, I set 1e-4 as default value.

Because of this, given all the complications that it was creating (requiring special version of functions, or limiting its usability with multiple adapters), I removed the residual initialization and all related modifications (so use_residual_init, delora_initial_A/B, etc). So now the method should be much simpler to handle. Please let me know what do you think about this, if maybe it was better to keep a residual initialization version, or if it's ok to keep it simple.

I also removed unnecessary methods and tests as you suggested. Since I removed use_residual_init and related methods, some modifications were not needed anymore. I also added some configs for experiments on MetaMathQA.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this last batch of changes.

The main modification is that I removed the residual initialization in favor of the zero initialization of B (making DeLoRA init similar to LoRA).

Great, thanks for testing this out. Indeed, the code is much simpler now and therefore, I think this is a much nicer solution than what we had before.

For this min value I tried different choices, and since nans can still appear in float16 for values below 1e-4, while leading to no significant performance differences in full precision, I set 1e-4 as default value.

Do you foresee different use cases requiring different values here? If yes, you could make it configurable through DeloraConfig with the default being 1e-4.

Apart from that, I only found a very small issue which I commented on. Once you're finished, remember to call make style before pushing.

nn.init.kaiming_uniform_(self.delora_A[adapter_name], a=math.sqrt(5))
nn.init.kaiming_uniform_(self.delora_B[adapter_name], a=math.sqrt(5))

self.delora_lambda[adapter_name].data.fill_(float(delora_lambda))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants