[WIP] Add LoRA multihead attention module #1324

BenjaminBossan · 2024-01-05T13:00:56Z

First stab at adding LoRA support for nn.MultiheadAttention. See #761.

Todos:

~~For now, only works with _qkv_same_embed_dim=True -- make it work with False too.~~ _qkv_same_embed_dim=False is out of scope for this PR and can be added in a later PR if needed.
Show that it works in a real world test: See user feedback on the issue.
Unit tests
~~Docs~~ Apart from docstrings, I don't think anything else needs to be added

Update: I now also included the out_proj to apply LoRA to.

This is a simple test that I ran successfully with the PR in its current state:

import open_clip
import requests
import torch
from torch import nn
from peft import LoraConfig, get_peft_model
from PIL import Image
from peft.tuners.lora.layer import MultiheadAttention as PeftMha

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K')
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K')
peft_model = get_peft_model(model, config)
opt = torch.optim.SGD(peft_model.parameters(), 0.1)
print(len([m for m in peft_model.modules() if isinstance(m, PeftMha)]))  # 64 PEFT MHA layers
peft_model.print_trainable_parameters()  # trainable params: 2,588,672 || all params: 1,055,873,793 || trainable%: 0.24516869508096598

# text encoder
text = tokenizer(["a diagram", "a dog", "a cat"])
text_features = peft_model.encode_text(text)
loss = text_features.sum()
loss.backward()
opt.step()

# image encoder
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
image_features = model.encode_image(image)
image_features.sum().backward()
opt.step()

For now, only works with _qkv_same_embed_dim=True.

HuggingFaceDocBuilderDev · 2024-01-05T13:04:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This is no longer necessary when unloading the model because the base_layer is already the original layer. This is just a leftover from before we adopted the base_layer pattern.

There was a bug because the removal of the parameter resulted in it no longer appearing in the state_dict and named_parameters. This commit fixes this bug. The bug also exists in the referenced lora-torch library.

younesbelkada

Nice work ! I left few preliminary comments, I think we can go for the _restore_weights approach for now as I don't see any other alternative

younesbelkada · 2024-01-09T05:59:19Z

src/peft/tuners/lora/layer.py

+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        fan_in_fan_out: bool = False,  # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
+        is_target_conv_1d_layer: bool = False,


Suggested change

is_target_conv_1d_layer: bool = False,

I don't think this is used?

younesbelkada · 2024-01-09T05:59:28Z

src/peft/tuners/lora/layer.py

+
+        self._active_adapter = adapter_name
+        self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora)
+        self.is_target_conv_1d_layer = is_target_conv_1d_layer


Suggested change

self.is_target_conv_1d_layer = is_target_conv_1d_layer

We can also just hard-code it to False

younesbelkada · 2024-01-09T06:02:00Z

src/peft/tuners/lora/layer.py

+        self._restore_weights()
+        return super().state_dict(*args, **kwargs)
+
+    def named_modules(self, *args, **kwargs):


do we need also to over-write the modules() method?

Not needed, as modules calls named_modules under the hood. I added a comment to that effect.

younesbelkada · 2024-01-09T06:04:55Z

src/peft/tuners/lora/model.py

@@ -193,11 +193,6 @@ def _replace_module(self, parent, child_name, new_module, child):
        if hasattr(child, "base_layer"):
            child = child.base_layer

-        if not hasattr(new_module, "base_layer"):


Why this has been removed?

Sorry, forgot to put this into the description of the PR.

These lines are obsolete for some time now. They only apply when we unload the model (otherwise, the if does not match). Remember when we made the base_layer switch, we ensured that when unloading, we simply return the base_layer, no more need to create a new layer (say, a new nn.Linear when using lora.Linear) and replace the new layer's weight by the parent layer's weight. The base_layer already has the original weight. Therefore, these lines are unnecessary.

I removed them now because they were annoying with MultiheadAttention, because that layer has no weight attribute, so this line would fail.

- Some clarifying comments - Remove fan_in_fan_out Also: - Raise proper error instead of assert

pacman100

Thank you Benjamin for adding support for torch MHA layer in LoRA, interesting way to use merge, forward and unmerge logic!

BenjaminBossan · 2024-01-10T10:37:41Z

@younesbelkada Could I address all your concerns?

I pinged the user who wanted to test it on their case. When it comes to docs, I didn't really find a place where we list all supported layers, so no update needed really.

Before, LoRA was applied only to the in_proj. Now it is also applied to the out_proj. Unfortunately, there is no easy way to just apply a normal lora.Linear to the out_proj by targeting it with target_modules. If that worked, it would be much nicer to do that, so that users can decide for themselves if they want to apply LoRA to the out_proj or not. The reason why it doesn't work is twofold: 1. We cannot really control the order in which LoRA is applied, so when the LoRA adapter is injected to out_proj, the whole MHA layer may already be wrapped by lora.MultiheadAttention. 2. Even if we successfully applied a normal lora.Linear to the out_proj, it would not work correctly. This is because the forward method of out_proj is not used at all by nn.MultiheadAttention. Instead, it just passes the weight and bias to F.multi_head_attention_forward. Therefore, we must ensure that the weights are merged and unmerged correctly, same as for in_proj, and we cannot do that if we use a normal lora.Linear. Note that the test test_merge_layers for MHA fails. This is most likely because of an existing bug in now merging is implemented, see PR huggingface#1355. Once that is merged, the test should pass.

BenjaminBossan · 2024-01-12T16:21:54Z

Note: The test test_merge_layers for MHA fails. This is most likely because of an existing bug in how merging is implemented, see PR #1355. Once that is merged, the test should pass.

ambroser53 · 2024-01-23T15:53:00Z

Just want to bump a bunch of the issues I've mentioned in #761 but specifically the problem with requires_grad reproducable in this repo

bghira · 2024-02-26T14:58:32Z

just wanted to bump this one because it's really the only way for tuning CLIP models after they are released.

BenjaminBossan · 2024-02-26T15:45:54Z

@bghira Do you happen to have a use case where you could test if this PR works and is working well enough speed-wise? I think the implementation could be ready to be merged but ideally we'd have someone with a real use case give it a try.

bghira · 2024-02-26T16:57:24Z

i do and i may be able to test it. stupid question but is the code example above complete? i dont see the hinge loss function

BenjaminBossan · 2024-02-26T17:13:09Z

stupid question but is the code example above complete? i dont see the hinge loss function

You mean the code right at the top? No, it's not complete at all, just a quick test to show that MHA is applied and the backward pass does not fail. This is not proper nor complete training code.

Params need to be re-registered to appear in state dict.

Had to port some accelerate functions to peft and modify them for this to work.

bghira · 2024-12-09T11:34:22Z

not stale

coding-kuku · 2024-12-27T12:46:36Z

@BenjaminBossan
Hello,

I tried using the code from this PR, but I found that the batch_first attribute of nn.MultiheadAttention is lost after converting the model to a LoRA model. Below is the minimal code to reproduce the issue:

import torch
import torch.nn as nn
from copy import deepcopy
from peft import LoraConfig, get_peft_model

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.mha = nn.MultiheadAttention(1024, 8)

net = Net()

lora_config = LoraConfig(inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=['mha']
)
lora_net = get_peft_model(deepcopy(net), lora_config)

print(hasattr(net.mha, 'batch_first'))
print(hasattr(lora_net.mha, 'batch_first'))

Output:

True
False

As you can see, batch_first is True for the original MultiheadAttention module, but after converting the model to LoRA, the batch_first attribute is no longer present.

Is this an expected behavior, or is there a way to preserve the batch_first attribute during the conversion?

BenjaminBossan · 2025-01-06T11:40:26Z

@coding-kuku thanks for the question.

Is this an expected behavior, or is there a way to preserve the batch_first attribute during the conversion?

Generally, when PEFT wraps a layer, you can access the original layer by inserting the .base_layer attribute, so in your case lora_net.mha.base_layer. But I also pushed a change to expose the original attributes of the MHA layer like batch_first, embed_dim etc. so that this should not be necessary.

githubnemo

This looks good in general but I had some questions / comments.

src/peft/tuners/lora/layer.py

githubnemo · 2025-01-06T13:26:28Z

src/peft/tuners/lora/layer.py

+                    # TODO: work with separate weights
+                    weight_merged = base_layer.in_proj_weight.data.detach() + self.get_delta_weight(active_adapter)
+                    del base_layer.in_proj_weight
+                    base_layer.in_proj_weight = weight_merged


Shouldn't this throw an exception? AFAICS we're assigning a tensor to a parameter value:

foo = torch.nn.Linear(10, 100) foo.weight = foo.weight.detach() # raises

What am I missing?

It's true that we change the type here, I guess you could consider this part of the hack to make this work. At the end, through _restore_weights, the correct type is restored.

ah, yes. I missed the del statement which unregisters the parameter and, thus, removes the setattr constraint. WDYT about something along the lines of

# unregister parameter implicitly and overwrite using merged weights; gradients are computed # after forward and, thus, after unmerging (see forward()), therefore this is safe to do. del base_layer.in_proj_weight base_layer.in_proj_weight = orig_weights_in

src/peft/tuners/lora/layer.py

githubnemo · 2025-01-06T13:58:03Z

src/peft/tuners/lora/layer.py

+        base_layer = self.get_base_layer()
+        weight = base_layer.in_proj_weight
+        del base_layer.in_proj_weight
+        base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
+
+        # out_proj
+        base_layer = base_layer.out_proj.get_base_layer()
+        weight = base_layer.weight
+        del base_layer.weight
+        base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))


This is equivalent to the register_parameter calls in unregister except for the weight content, right? Maybe refactor this into a function for brevity?

Not sure what you're referring to, where is unregister?

Sorry, I meant unmerge :)

I see. There is similar code in unmerge, unload_and_optionally_merge_module, and _restore_weights, true. However, it is not quite identical and we would need two new methods, one for each weight. I think at this point, there is not much gained for refactoring this, WDYT?

Maybe I'm missing something but I don't think you'd need one for each - the whole section is pretty much identical, no?

But it is absolutely not crucial to change this.

def restore_parameters(base_layer, in_proj_weight, in_req_grad, out_proj_weight, out_req_grad): del base_layer.in_proj_weight base_layer.register_parameter( "in_proj_weight", nn.Parameter(in_proj_weight.data, requires_grad=in_req_grad) ) out_proj_base_layer = base_layer.out_proj.get_base_layer() del out_proj_base_layer.weight out_proj_base_layer.register_parameter( "weight", nn.Parameter(out_proj_weight.data, requires_grad=out_req_grad), ) """ base_layer = self.get_base_layer() weight = base_layer.in_proj_weight del base_layer.in_proj_weight base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) # out_proj base_layer = base_layer.out_proj.get_base_layer() weight = base_layer.weight del base_layer.weight base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) """ restore_parameters( self.get_base_layer(), base_layer.in_proj.weight, base_layer.in_proj.weight.requires_grad, base_layer.weight, base_layer.weight.requires_grad, ) """ # in_proj old_weight = base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter) del base_layer.in_proj_weight base_layer.register_parameter("in_proj_weight", nn.Parameter(old_weight, requires_grad=False)) # out_proj old_weight = base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight( active_adapter ) del base_layer.out_proj.base_layer.weight base_layer.out_proj.base_layer.register_parameter( "weight", nn.Parameter(old_weight, requires_grad=False) ) """ restore_parameters( base_layer, base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter), False, base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight(active_adapter) False, ) """ # extra steps: re-register weights, take care of out_proj layer # in_proj weight = base_layer.in_proj_weight del base_layer.in_proj_weight base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) # out_proj out_proj_layer = base_layer.out_proj.get_base_layer() weight = out_proj_layer.weight del out_proj_layer.weight out_proj_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) """ restore_parameters( base_layer, base_layer.in_proj_weight, base_layer.in_proj_weight.requires_grad, base_layer.weight, base_layer.weight.requires_grad, )

I think in this case, I prefer the existing version. In terms of lines of code, there isn't much gained and in the end, we abstract away one del and register_parameter call per parameter. If more steps were involved, I'd agree that a dedicated method would make more sense.

src/peft/utils/integrations.py

Co-authored-by: githubnemo <[email protected]>

BenjaminBossan

Thanks for the review @githubnemo, I committed your suggestions and replied to your comments.

src/peft/tuners/lora/layer.py

BenjaminBossan · 2025-01-06T15:06:06Z

src/peft/tuners/lora/layer.py

+                    # TODO: work with separate weights
+                    weight_merged = base_layer.in_proj_weight.data.detach() + self.get_delta_weight(active_adapter)
+                    del base_layer.in_proj_weight
+                    base_layer.in_proj_weight = weight_merged


It's true that we change the type here, I guess you could consider this part of the hack to make this work. At the end, through _restore_weights, the correct type is restored.

BenjaminBossan · 2025-01-06T15:08:06Z

src/peft/tuners/lora/layer.py

+        base_layer = self.get_base_layer()
+        weight = base_layer.in_proj_weight
+        del base_layer.in_proj_weight
+        base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
+
+        # out_proj
+        base_layer = base_layer.out_proj.get_base_layer()
+        weight = base_layer.weight
+        del base_layer.weight
+        base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))


Not sure what you're referring to, where is unregister?

githubnemo

Thanks for the clarifications, some comments left.

githubnemo · 2025-01-07T15:10:16Z

src/peft/tuners/lora/model.py

+                elif getattr(child, "q_proj_weight", None) is not None:  # MHA
+                    weight = child.q_proj_weight


In the case we support this is never not None, right?

You mean getattr(child, "q_proj_weight", None) is not None can never evaluate to False, thus the else clause below is not needed? I think it would be good to have that fallback, in case we do miss something.

No, I meant that q_proj_weight is always None in our case. (_qkv_same_embed_dim = True)

Ah yes, sorry, you're right. This is there in case we add support for the other mode in the future.

githubnemo · 2025-01-07T15:27:14Z

src/peft/tuners/lora/layer.py

+                    # TODO: work with separate weights
+                    weight_merged = base_layer.in_proj_weight.data.detach() + self.get_delta_weight(active_adapter)
+                    del base_layer.in_proj_weight
+                    base_layer.in_proj_weight = weight_merged


ah, yes. I missed the del statement which unregisters the parameter and, thus, removes the setattr constraint. WDYT about something along the lines of

# unregister parameter implicitly and overwrite using merged weights; gradients are computed # after forward and, thus, after unmerging (see forward()), therefore this is safe to do. del base_layer.in_proj_weight base_layer.in_proj_weight = orig_weights_in

githubnemo · 2025-01-07T16:50:48Z

src/peft/tuners/lora/layer.py

+        base_layer = self.get_base_layer()
+        weight = base_layer.in_proj_weight
+        del base_layer.in_proj_weight
+        base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
+
+        # out_proj
+        base_layer = base_layer.out_proj.get_base_layer()
+        weight = base_layer.weight
+        del base_layer.weight
+        base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))


Maybe I'm missing something but I don't think you'd need one for each - the whole section is pretty much identical, no?

But it is absolutely not crucial to change this.

def restore_parameters(base_layer, in_proj_weight, in_req_grad, out_proj_weight, out_req_grad): del base_layer.in_proj_weight base_layer.register_parameter( "in_proj_weight", nn.Parameter(in_proj_weight.data, requires_grad=in_req_grad) ) out_proj_base_layer = base_layer.out_proj.get_base_layer() del out_proj_base_layer.weight out_proj_base_layer.register_parameter( "weight", nn.Parameter(out_proj_weight.data, requires_grad=out_req_grad), ) """ base_layer = self.get_base_layer() weight = base_layer.in_proj_weight del base_layer.in_proj_weight base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) # out_proj base_layer = base_layer.out_proj.get_base_layer() weight = base_layer.weight del base_layer.weight base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) """ restore_parameters( self.get_base_layer(), base_layer.in_proj.weight, base_layer.in_proj.weight.requires_grad, base_layer.weight, base_layer.weight.requires_grad, ) """ # in_proj old_weight = base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter) del base_layer.in_proj_weight base_layer.register_parameter("in_proj_weight", nn.Parameter(old_weight, requires_grad=False)) # out_proj old_weight = base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight( active_adapter ) del base_layer.out_proj.base_layer.weight base_layer.out_proj.base_layer.register_parameter( "weight", nn.Parameter(old_weight, requires_grad=False) ) """ restore_parameters( base_layer, base_layer.in_proj_weight.data - self.get_delta_weight(active_adapter), False, base_layer.out_proj.base_layer.weight.data - base_layer.out_proj.get_delta_weight(active_adapter) False, ) """ # extra steps: re-register weights, take care of out_proj layer # in_proj weight = base_layer.in_proj_weight del base_layer.in_proj_weight base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) # out_proj out_proj_layer = base_layer.out_proj.get_base_layer() weight = out_proj_layer.weight del out_proj_layer.weight out_proj_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad)) """ restore_parameters( base_layer, base_layer.in_proj_weight, base_layer.in_proj_weight.requires_grad, base_layer.weight, base_layer.weight.requires_grad, )

BenjaminBossan

@githubnemo I added the comment as per your suggestion and replied to you comments, please check.

BenjaminBossan · 2025-01-08T09:54:50Z

src/peft/tuners/lora/layer.py

+                    # TODO: work with separate weights
+                    weight_merged = base_layer.in_proj_weight.data.detach() + self.get_delta_weight(active_adapter)
+                    del base_layer.in_proj_weight
+                    base_layer.in_proj_weight = weight_merged


BenjaminBossan · 2025-01-08T09:58:36Z

src/peft/tuners/lora/layer.py

+        base_layer = self.get_base_layer()
+        weight = base_layer.in_proj_weight
+        del base_layer.in_proj_weight
+        base_layer.register_parameter("in_proj_weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))
+
+        # out_proj
+        base_layer = base_layer.out_proj.get_base_layer()
+        weight = base_layer.weight
+        del base_layer.weight
+        base_layer.register_parameter("weight", nn.Parameter(weight.data, requires_grad=weight.requires_grad))


I think in this case, I prefer the existing version. In terms of lines of code, there isn't much gained and in the end, we abstract away one del and register_parameter call per parameter. If more steps were involved, I'd agree that a dedicated method would make more sense.

BenjaminBossan · 2025-01-08T10:01:54Z

src/peft/tuners/lora/model.py

+                elif getattr(child, "q_proj_weight", None) is not None:  # MHA
+                    weight = child.q_proj_weight


You mean getattr(child, "q_proj_weight", None) is not None can never evaluate to False, thus the else clause below is not needed? I think it would be good to have that fallback, in case we do miss something.

Not supported yet

For now, only works with _qkv_same_embed_dim=True. --------- Co-authored-by: Wang, Yi <[email protected]> Co-authored-by: keakon <[email protected]> Co-authored-by: Zach Mueller <[email protected]> Co-authored-by: Saeid Ghafouri <[email protected]> Co-authored-by: Fanli Lin <[email protected]> Co-authored-by: githubnemo <[email protected]>

vietvo89 · 2025-06-07T05:21:30Z

Hey @BenjaminBossan , thank you for your great work on developing and adding MultiheadAttention to PEFT package. Your MultiheadAttention worked with CLIP from HuggingFace since k_proj, q_proj, v_proj and fc are nn.Linear. But it seems it is incompatible with CLIP from open_clip_torch package. I followed your code snippet above but I couldn't find the config. Based on the open_clip model architecture like below:

CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 1408, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (patch_dropout): Identity()
    (ln_pre): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): ModuleList(
        (0-39): 40 x ResidualAttentionBlock(
          (ln_1): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1408, out_features=1408, bias=True)
          )
          (ls_1): Identity()
          (ln_2): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=1408, out_features=6144, bias=True)
            (gelu): GELU(approximate='none')
            (c_proj): Linear(in_features=6144, out_features=1408, bias=True)
          )
          (ls_2): Identity()
        )
      )
    )
    (ln_post): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
  )

I made my own config as follows:

lora_config = LoraConfig(
    target_modules=["attn"],
)
peft_model = get_peft_model(model, lora_config)

The error I got is:

ValueError: Target module MultiheadAttention(
  (out_proj): NonDynamicallyQuantizableLinear(in_features=1408, out_features=1408, bias=True)
) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `torch.nn.Conv3d`, `transformers.pytorch_utils.Conv1D`.

Questions:

Is my config correct for applying lora for MultiheadAttention? If not, can you please show me the correct config I can use?
If I'm correct, how did your code snippet work, and what should I do to address this error?
Another unexpected error is ImportError: cannot import name 'MultiheadAttention' from 'peft.tuners.lora.layer' (/usr/local/lib/python3.11/dist-packages/peft/tuners/lora/layer.py) when I followed your example from peft.tuners.lora.layer import MultiheadAttention as PeftMha. I installed transformers==4.51.3.

BenjaminBossan · 2025-06-10T14:19:30Z

@vietvo89 Judging from the error message you show, you're using an old PEFT version from before MHA support being added. Could you please upgrade PEFT to the latest version, e.g.: python -m pip install -U peft.

[WIP] Add LoRA multihead attention module

49fab86

For now, only works with _qkv_same_embed_dim=True.

BenjaminBossan mentioned this pull request Jan 5, 2024

fine-tuning OpenClip with Hugingface's PEFT (such as LoRA) #761

Closed

BenjaminBossan added 5 commits January 5, 2024 14:08

Make style

d8e9589

Remove commented code

0e188a3

Remove assignment of weight to new module

b409d81

This is no longer necessary when unloading the model because the base_layer is already the original layer. This is just a leftover from before we adopted the base_layer pattern.

Make state_dict and named_parameters work

173062c

There was a bug because the removal of the parameter resulted in it no longer appearing in the state_dict and named_parameters. This commit fixes this bug. The bug also exists in the referenced lora-torch library.

Extend test coverage a bit

1e007f5

younesbelkada reviewed Jan 9, 2024

View reviewed changes

BenjaminBossan added 4 commits January 9, 2024 11:49

Clean ups after reviewer feedback:

557c4a1

- Some clarifying comments - Remove fan_in_fan_out Also: - Raise proper error instead of assert

Reviewer feedback: removed another unnecessary arg

add1f51

Make style

e44e030

Merge branch 'main' into feat-add-lora-multihead-attention

8d62579

pacman100 approved these changes Jan 9, 2024

View reviewed changes

BenjaminBossan added 3 commits February 7, 2024 15:41

Merge branch 'main' into feat-add-lora-multihead-attention

9dc4a4d

Fix bug with incorrectly set gradient

c3fb2ce

Fix failing tests

17d407b

BenjaminBossan added 3 commits February 26, 2024 16:24

Merge branch 'main' into feat-add-lora-multihead-attention

4cbf6e9

Move to pytest style asserts

e0cae11

Fix safe merging code

52c8d9b

BenjaminBossan added 2 commits March 11, 2024 11:48

Merge branch 'main' into feat-add-lora-multihead-attention

977c84b

No need to set bias for MHA anymore, see huggingface#1530

96d376d

BenjaminBossan added 5 commits October 21, 2024 15:42

Fix bug with unloading multihead attention layer

4c31bbc

Fix bug in unloading

1dbb9a5

Params need to be re-registered to appear in state dict.

Fix for low_cpu_mem_usage

e094234

Had to port some accelerate functions to peft and modify them for this to work.

Merge branch 'main' into feat-add-lora-multihead-attention

e90af48

Merge branch 'main' into feat-add-lora-multihead-attention

30a08e7

EngEmmanuel mentioned this pull request Nov 6, 2024

RuntimeError: element 0 of tensors.. OpenCLIP model #2200

Closed

4 tasks

BenjaminBossan added 3 commits November 26, 2024 16:00

Add tests for init_empty_weights

09f5ea6

Merge branch 'main' into feat-add-lora-multihead-attention

6a83bd7

Merge branch 'main' into feat-add-lora-multihead-attention

3b0471a

Add MHA to modules unsupported by EVA

465a85e

BenjaminBossan added 2 commits January 6, 2025 12:06

Add comment on why/how empty init works

266f9da

Expose attributes of underlying MHA module

39e755e

githubnemo reviewed Jan 6, 2025

View reviewed changes

Apply suggestions from code review

4857858

Co-authored-by: githubnemo <[email protected]>

BenjaminBossan commented Jan 6, 2025

View reviewed changes

BenjaminBossan added 2 commits January 6, 2025 16:10

Remove trailing whitespace

74cbba6

Linting..

14deb9f

githubnemo reviewed Jan 7, 2025

View reviewed changes

Reviewer comment: Add comments for clarification

ba2a8dd

BenjaminBossan commented Jan 8, 2025

View reviewed changes

Reviewer feedback: Remove q_proj_weight

ac10b18

Not supported yet

githubnemo approved these changes Jan 8, 2025

View reviewed changes

BenjaminBossan merged commit 8d3039b into huggingface:main Jan 8, 2025
13 of 14 checks passed

BenjaminBossan deleted the feat-add-lora-multihead-attention branch January 8, 2025 16:35

		elif getattr(child, "q_proj_weight", None) is not None: # MHA
		weight = child.q_proj_weight

[WIP] Add LoRA multihead attention module #1324

[WIP] Add LoRA multihead attention module #1324

Conversation

BenjaminBossan commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 5, 2024

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pacman100 left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Jan 10, 2024

Uh oh!

BenjaminBossan commented Jan 12, 2024

Uh oh!

ambroser53 commented Jan 23, 2024

Uh oh!

bghira commented Feb 26, 2024

Uh oh!

BenjaminBossan commented Feb 26, 2024

Uh oh!

bghira commented Feb 26, 2024

Uh oh!

BenjaminBossan commented Feb 26, 2024

Uh oh!

bghira commented Dec 9, 2024

Uh oh!

coding-kuku commented Dec 27, 2024

Output:

Uh oh!

BenjaminBossan commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Jan 5, 2024 •

edited

Loading

BenjaminBossan commented Jan 6, 2025 •

edited

Loading

githubnemo Jan 8, 2025 •

edited

Loading