FEAT add DeLoRA #2780

mwbini · 2025-09-10T00:29:12Z

I implemented a first version, based on #2760

BenjaminBossan

Thanks a lot for this PR to add DeLoRA to PEFT. The PR is already quite advanced and in a nice state, good job with that.

I did a review of the PR and added a few comments, please check those. But while working on this, I also noticed some more general areas for discussion, which should be addressed first:

Naming

In general, I would suggest to rename variables with DeLoRA to Delora or DeLora. Not only is that easier to type, it's also consistent with the other names, e.g. LoraConfig.

Initialization

I think a lot of complexity of the implementation comes from the initialization with use_residual_init=True. This is the reason why we need the delora_initial_* parameters and update methods like _load_from_state_dict, right? I wonder if we can think of a way to significantly reduce this complexity.

First of all, I still need to get a good understanding why this is required in the first place. Do I get it right that normally, initializing DeLoRA cannot be done in a way that makes it an identity transform? So there is no choice of delora_A and delora_B that would make their contribution zero? For me, this would be the ideal solution.

Second, how important is it to have the use_residual_init=True option? Is it better for training or could we just drop it entirely? I know that this would require some adjustments to the unit tests, as some tests assume that the adapter is an identity transform without training. But these adjustments can be done and it is not a requirement for PEFT methods to have an initialization option that comes down to identity.

If this option is not strictly needed for performance reasons but mainly to satisfy tests, I would rather remove it, eliminate a lot of the complexity associated with it, and then update the tests.

Device and dtype handling

There is a lot of code that deals with moving the DeLoRA parameters to the correct device and dtype, such as:

A = self.delora_A[a].to(dtype=comp_dtype, device=device)
B = self.delora_B[a].to(dtype=comp_dtype, device=device)
alpha = self.delora_alpha[a].to(dtype=comp_dtype, device=device)

Would it not make sense to initialize them with the correct device and dtype right from the start (as is normally done for other PEFT methods) instead of casting and moving them each time, which can be costly? Or is this some sort of memory optimization?

Other

Those are the big points I found while reviewing and they should be the focus for the next iteration. For the final PR, I would also like to see an example (it could be a copy of an existing example with modifications to use DeLoRA) and setting for the PEFT method comparison suite.

Finally, please always call make style before pushing your changes.

docs/source/package_reference/delora.md

BenjaminBossan · 2025-09-10T09:57:15Z

src/peft/tuners/delora/config.py

+        r (`int`):
+            DeLoRA rank.
+        alpha (`int`):
+            The alpha parameter for DeLoRA boundary.


Just curious: IIUC, this corresponds to lambda in the paper, right? Why was it renamed? Because the keyword is protected in Python or to make it more similar to lora_alpha? But this alpha has a different role than alpha has for LoRA.

Yes I didn't use lambda because it was protected. But indeed using alpha could be confusing. For consistency with the paper I changed it to delora_lambda and lambda_ now

src/peft/tuners/delora/config.py

src/peft/tuners/delora/layer.py

mwbini · 2025-09-11T23:54:36Z

Hi @BenjaminBossan, thank you for thorough review! I made a few changes following your suggestions. I will address below your main concerns

Naming: I agree, I now modified everything to Delora*

Initialization: the method requires to be identity-initialized to work properly. To do this while avoiding divisions by zeros in the normalization (which could lead to instabilities), the idea is to add and subtract the same initial weights W + BA - BA and keep one of the copies frozen (delora_initial_A, delora_initial_B). With use_residual_init=False, the frozen copies are directly used in the forward pass, and merging/unmerging and others should work similarly to other peft methods (on the negative, these initial weights need to be kept on device, and require extra computations). With use_residual_init=True instead, W - BA are merged once at the beginning, and the forward pass simply uses the trainable delora_A delora_B. This saves some computations, and avoids keeping on device delora_initial_A, delora_initial_B, but indeed adds several complications when merging/unmerging etc, requiring before each operation to add back the initial BA (offloaded to cpu) from the base weights W (so to correctly go back to the base model). I tried to describe the role of use_residual_init a bit more in the config file

Device and dtype handling: a lot of these were indeed redundant/unnecessary. I simplified and tried to remove most of them now

BenjaminBossan

Thanks for the updates. I did a light pass over the changes and added some comments. Some of my previous comments were not addressed yet, so please take a look at them too.

Regarding the initialization, thanks for explaining. So IIUC, we would want to have W' + BA = W_base because it stabilizes training. We cannot set B=0 because of the normalization step. I wonder if there is still another way to get an identity initialization, or at least something close to it, without either saving copies or modifying the base weight, but choosing A and B in a careful way. Perhaps there is a possibility through orthonormal initialization with QR decomposition, but I'm not sure how this could be preserved with the normalization step.

If it's not possible, I think we can live with the extra complexity, but I believe it also means that, for use_residual_init=True, we cannot correctly deal with more than one DoLoRA adapter at a time, is that right? So we would need to implement a check in _check_new_adapter_config that at most one adapter with use_residual_init=True is being used.

For use_residual_init=False, I'd like to see what it entails in terms of overhead. Perhaps we can test it out with the PEFT method comparison suite.

src/peft/tuners/delora/config.py

docs/source/package_reference/delora.md

src/peft/tuners/delora/layer.py

mwbini · 2025-09-22T00:05:34Z

Thank you for the feedbacks. Now I should have addressed most of the comments, please let me know if something is missing.

Regarding the initialization I am not sure what could be the best approach. Maybe B=0 could still be used by setting a large enough min value in torch.clamp... As a next step, I will try to setup some experiments and check this zero initialization, the overhead of using use_residual_init=False, etc.

Regarding using multiple adapters, it would be indeed not possible with use_residual_init=True. For now, I added a check in _check_new_adapter_config that should check for this

BenjaminBossan

Thanks for the updates. I found a couple of new issues, but nothing big. Also, please merge with/rebase on the latest PEFT main branch and resolve any merge conflicts that may occur.

Regarding the initialization I am not sure what could be the best approach. Maybe B=0 could still be used by setting a large enough min value in torch.clamp... As a next step, I will try to setup some experiments and check this zero initialization, the overhead of using use_residual_init=False, etc.

Great, LMK if you find something.

BenjaminBossan · 2025-09-26T15:02:45Z

src/peft/tuners/delora/layer.py

+            if adapter_name in self.delora_initial_lambda:
+                del self.delora_initial_lambda[adapter_name]
+
+    def _move_adapter_to_device_of_base_layer(self, adapter_name: str, device: Optional[torch.device] = None) -> None:


It seems that we cannot use the _move_adapter_to_device_of_base_layer method from the parent class (BaseTunerLayer). Could you please add a short comment on what is different here?

BenjaminBossan · 2025-09-26T15:03:21Z

src/peft/tuners/delora/layer.py

+            if isinstance(base_layer, nn.MultiheadAttention):
+                base_layer = base_layer.out_proj


MHA is not supported by DeLoRA so this can be removed.

BenjaminBossan · 2025-09-26T15:09:12Z

src/peft/tuners/delora/layer.py

+            if key.startswith(prefix):
+                # Remove the prefix to get the relative key
+                relative_key = key[len(prefix) :]


relative_key = key.removeprefix(prefix) should do the job.

BenjaminBossan · 2025-09-26T15:10:18Z

src/peft/tuners/delora/layer.py

+                module_state_dict[relative_key] = value
+
+        # Manually load ParameterDict and BufferDict parameters
+        for param_name in ["delora_A", "delora_B", "delora_lambda"]:


We can use self.adapter_layer_names here, right?

BenjaminBossan · 2025-09-26T15:17:45Z

src/peft/tuners/delora/model.py

+        # TODO: there should be a check if any of the existing adapters actually has bias != "none", or else the check
+        # does not fully correspond to the error message.
+        if (len(self.peft_config) > 1) and (config.bias != "none"):
+            raise ValueError(
+                f"{self.__class__.__name__} supports only 1 adapter with bias. When using multiple adapters, "
+                "set bias to 'none' for all adapters."
+            )


Let's replace this with:

Suggested change

# TODO: there should be a check if any of the existing adapters actually has bias != "none", or else the check

# does not fully correspond to the error message.

if (len(self.peft_config) > 1) and (config.bias != "none"):

raise ValueError(

f"{self.__class__.__name__} supports only 1 adapter with bias. When using multiple adapters, "

"set bias to 'none' for all adapters."

)

super()._check_new_adapter_config(config)

BenjaminBossan · 2025-09-26T15:28:12Z

src/peft/tuners/delora/model.py

+                "set use_residual_init to False for all adapters."
+            )
+
+    @staticmethod


We recently had a PR (#2771) that allows us to greatly reduce the number of methods required to implement here. You can simply remove the following methods:

_check_target_module_exists

_mark_only_adapters_as_trainable

__getattr__

get_peft_config_as_dict

_set_adapter_layers

enable_adapter_layers

disable_adapter_layers

set_adapter

_prepare_adapter_config

_unload_and_optionally_merge

merge_and_unload

unload

On the flip side, you need to add:

tuner_layer_cls = DeloraLayer target_module_mapping = TRANSFORMERS_MODELS_TO_DELORA_TARGET_MODULES_MAPPING

as class attributes, i.e. directly below prefix: str = "delora_".

For this to work, you also need to merge with/rebase on the latest main branch.

BenjaminBossan · 2025-09-26T15:30:35Z

tests/test_initialization.py

+    def test_disable_enable_no_change(self, data):
+        base = self.get_model()
+        cfg = DeloraConfig(target_modules=["lin0"], r=8, lambda_=8)
+        model = get_peft_model(base, cfg)
+
+        y0 = model(data)
+        model.disable_adapter_layers()
+        y1 = model(data)
+        model.enable_adapter_layers()
+        y2 = model(data)
+
+        assert torch.allclose(y0, y1, atol=1e-6, rtol=1e-6)
+        assert torch.allclose(y0, y2, atol=1e-6, rtol=1e-6)
+
+    def test_merge_and_unload_same_outputs(self, data):
+        base = self.get_model()
+        cfg = DeloraConfig(target_modules=["lin0"], r=8, lambda_=8)
+        model = get_peft_model(base, cfg)
+        y_before = model(data)
+
+        merged = model.merge_and_unload()
+        merged.eval()
+        y_after = merged(data)
+        assert torch.allclose(y_before, y_after, atol=1e-6, rtol=1e-6)


I think this should already be well covered by the existing tests and can be removed here.

BenjaminBossan · 2025-09-26T15:32:27Z

src/peft/tuners/delora/model.py

+                "set bias to 'none' for all adapters."
+            )
+        # if there are multiple adapters, residual init cannot be used
+        if (len(self.peft_config) > 1) and (config.use_residual_init):


Hmm, would it work if I first add a DeLoRA adapter with use_residual_init=False and then a DeLoRA adapter with use_residual_init=True? I think it should work but this check would result in an error. IIUC, the correct check is that there cannot be more than one config with use_residual_init=True.

tests/test_initialization.py

…ice moving, others..

…th init_weights and multiple adapters

…e for reverting to base weights

…dify tests, add MetaMathQA experiment configs

…sary tests

mwbini · 2025-10-12T17:28:59Z

Hi @BenjaminBossan, thank you for your review. I now pushed several changes (last two commits) and rebased to the latest version of the code. The main modification is that I removed the residual initialization in favor of the zero initialization of B (making DeLoRA init similar to LoRA).
I tested if setting a high enough minimal value in torch.clamp in the A/B normalization would avoid creating issues (by testing on both GLUE and MetaMathQA), and it seems that indeed issues do not appear, and performance are very similar to using the residual initialization. For this min value I tried different choices, and since nans can still appear in float16 for values below 1e-4, while leading to no significant performance differences in full precision, I set 1e-4 as default value.

Because of this, given all the complications that it was creating (requiring special version of functions, or limiting its usability with multiple adapters), I removed the residual initialization and all related modifications (so use_residual_init, delora_initial_A/B, etc). So now the method should be much simpler to handle. Please let me know what do you think about this, if maybe it was better to keep a residual initialization version, or if it's ok to keep it simple.

I also removed unnecessary methods and tests as you suggested. Since I removed use_residual_init and related methods, some modifications were not needed anymore. I also added some configs for experiments on MetaMathQA.

HuggingFaceDocBuilderDev · 2025-10-13T12:54:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for this last batch of changes.

The main modification is that I removed the residual initialization in favor of the zero initialization of B (making DeLoRA init similar to LoRA).

Great, thanks for testing this out. Indeed, the code is much simpler now and therefore, I think this is a much nicer solution than what we had before.

For this min value I tried different choices, and since nans can still appear in float16 for values below 1e-4, while leading to no significant performance differences in full precision, I set 1e-4 as default value.

Do you foresee different use cases requiring different values here? If yes, you could make it configurable through DeloraConfig with the default being 1e-4.

Apart from that, I only found a very small issue which I commented on. Once you're finished, remember to call make style before pushing.

BenjaminBossan · 2025-10-13T12:43:03Z

src/peft/tuners/delora/layer.py

+            nn.init.kaiming_uniform_(self.delora_A[adapter_name], a=math.sqrt(5))
+            nn.init.kaiming_uniform_(self.delora_B[adapter_name], a=math.sqrt(5))
+
+        self.delora_lambda[adapter_name].data.fill_(float(delora_lambda))


This line is redundant with https://github.com/huggingface/peft/pull/2780/files#diff-1acd6f7d805393f2f46612371a37e1d6511b12b40cc3a5817c2e61524e334a8cR120, right?

BenjaminBossan requested changes Sep 10, 2025

View reviewed changes

BenjaminBossan requested changes Sep 12, 2025

View reviewed changes

BenjaminBossan requested changes Sep 26, 2025

View reviewed changes

BenjaminBossan mentioned this pull request Oct 6, 2025

Integrate DeLoRA #2760

Open

mwbini added 7 commits October 12, 2025 15:11

FEAT add DeLoRA

ae848d4

address comments: rename, improve desriptions, simplify redundant dev…

f72e239

…ice moving, others..

style corrections

e7bba2d

add rank/lambda_pattern_, check compatibility of use_residual_init wi…

4b57b3e

…th init_weights and multiple adapters

clean by reusing _compute_delta(), add unapply_initials arg in unmerg…

69194cf

…e for reverting to base weights

switch residual init with zero init, remove unnecessary functions, mo…

13dc359

…dify tests, add MetaMathQA experiment configs

remove unnecessary methods following PR huggingface#2771, and unneces…

f98865b

…sary tests

mwbini force-pushed the add-delora branch from 4a90a21 to f98865b Compare October 12, 2025 16:41

BenjaminBossan requested changes Oct 13, 2025

View reviewed changes

		if isinstance(base_layer, nn.MultiheadAttention):
		base_layer = base_layer.out_proj

FEAT add DeLoRA #2780

Are you sure you want to change the base?

FEAT add DeLoRA #2780

Uh oh!

Conversation

mwbini commented Sep 10, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Naming

Initialization

Device and dtype handling

Other

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwbini commented Sep 11, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwbini commented Sep 22, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mwbini commented Oct 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 13, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants