-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add VB-LoRA #2039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VB-LoRA #2039
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR that adds VB-LoRA to PEFT. Your implementation already looks quite advanced, is following the existing examples, and includes the required testing, nice 🎉
In my first review, I focused on the actual VB-LoRA implementation, so I haven't checked the examples and docs yet. As you can see, I added a bunch of comments but those should not be big issues. Please check if they make sense.
Regarding the paper, I also found two typos, in case you want to fix them: "virtrual" and "backpropgatation".
src/peft/tuners/vblora/config.py
Outdated
) | ||
}, | ||
) | ||
save_topk_weights: bool = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save_only_topk_weights would be a more fitting name, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I will modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
init_vector_bank_bound: float = field( | ||
default=0.02, | ||
metadata={ | ||
"help": ( | ||
"The vector bank is initialized with a uniform distribution between -init_vector_bank_bound and" | ||
" init_vector_bank_bound." | ||
), | ||
}, | ||
) | ||
init_logits_std: float = field( | ||
default=0.1, | ||
metadata={ | ||
"help": ( | ||
"The logits are initialized with a normal distribution with a standard deviation of init_logits_std." | ||
), | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have any tips for how users should initialize the VB-LoRA specific parameters (num_vectors
, vector_length
, topk
, init_vector_bank_bound
, init_logits_std
), that would be great, as they can't use their existing intuitions from LoRA. Even a reference to the paper could be useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I will add some comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
src/peft/tuners/vblora/config.py
Outdated
self.target_modules = ( | ||
set(self.target_modules) if isinstance(self.target_modules, list) else self.target_modules | ||
) | ||
if self.save_topk_weights: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this warning is not necessary. If users changed this setting, they have certainly read the description. We want to avoid unnecessary warnings as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I will delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted.
src/peft/tuners/vblora/config.py
Outdated
will not produce the same output as the base model would have without adaptation. | ||
modules_to_save (`List[str]`): | ||
List of modules apart from Vera layers to be set as trainable and saved in the final checkpoint. | ||
init_weights (`bool`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, the init_weights
argument is not used. I think it would be great if it was supported the same way as in LoRA: If set to True
(default), VB-LoRA is initialized such that it performs and identity transform, and vice versa. Not sure if this fits well with the paper, we can also agree on a different default, but having that option would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our approach, we shouldn't initialize the additive adaptor to zero because if the vector bank is initialized to zero, both matrices A and B will be zero, resulting in a zero gradient. As a result, we can't align with other methods regarding init_weights=True
. Do you have any suggestions in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed your first suggestion in the comment:
- I removed
init_weights
. - In tests, I set the vector bank to zero to make VB-LoRA as an identity operation. And initialized the vector bank if training is involved in the test.
src/peft/tuners/vblora/model.py
Outdated
f"Target module {target} is not supported. Currently, only the following modules are supported: " | ||
"`torch.nn.Linear`, `transformers.pytorch_utils.Conv1D`." | ||
) | ||
r = kwargs.pop("r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will delete this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your replies.
Regarding the discussion about init_weights
, as it came up multiple times, I'll reply here:
From my understanding, the problem is that A and B matrices share the same vector bank. Therefore, if the vector bank is initialized to zeros, both A and B are zeros and therefore there is no gradient. This is unlike LoRA, where only B is zeros.
Also, it's not possible to set the logits for B to zero, as a softmax operation is performed, thus the weights will always sum up to 1.
I have three suggestions:
- For those cases where we want VB-LoRA to be a no-op, we just manually call
nn.init.ones_(model.base_model.vblora_vector_bank["default"])
. Theinit_weights
argument should be removed. - We add the option to pass
init_weights=False
(withTrue
being default) and in that case initialize the vector bank to zeros. This model is untrainable, so we give a warning that this is option only exists for testing purposes. - Add a new parameter that scales the contribution from B, which is a learnable parameter that starts at 0. This way, the vector bank is randomly initialized but VB-LoRA still starts as a no-op. Of course, this would be a deviation from the paper.
I also have a general comment about the docs: Let's make sure to mention that there is an option for VB-LoRA to reduce the size of the file when saving, but that the loaded adapter can only be used for inference after that. Ideally, you could give a numerical example (e.g. for Llama3 8b with parameter ..., the size is reduced from X to Y).
src/peft/tuners/vblora/layer.py
Outdated
@@ -0,0 +1,263 @@ | |||
# Copyright 2023-present the HuggingFace Inc. team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the dates in all new files to 2024.
Thank you for your valuable feedback! I have updated the code accordingly. I also updated the documentation in |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for the updates @leo-yangli. I haven't reviewed yet but saw that the CI fails because of the following test:
Could you please check? |
Thank you for your feedback. I just fixed that failed test. |
It's ready to be reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the updates. This is almost good to be merged, I only saw a few tiny issues left. Please take a look.
"id": "ddfc0610-55f6-4343-a950-125ccf0f45ac", | ||
"metadata": {}, | ||
"source": [ | ||
"In this example, we fine-tune Roberta on a sequence classification task using VB-LoRA." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a sentence that this notebook is based on examples/sequence_classficiation/VeRA.ipynb
and users can compare the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
" target_modules=['key', 'value', 'query', 'output.dense', 'intermediate.dense'],\n", | ||
" num_vectors=num_vectors,\n", | ||
" vector_length=vector_length,\n", | ||
" save_only_topk_weights=True,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment here that to explain this argument. Using this also means that we cannot resume training from checkpoints if training is interrupted for some reason, right? Let's also mention that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
tests/test_initialization.py
Outdated
with pytest.raises(ValueError, match=msg): | ||
get_peft_model(model, config1) | ||
|
||
config2 = VBLoRAConfig(target_modules=["lin1"], vector_length=vector_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split this into 2 tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
src/peft/tuners/vblora/config.py
Outdated
The length of the vectors in the vector bank. The length of the vectors should be divisible by the hidden | ||
dimension of the model. | ||
topk (`int`): | ||
K value for topk selection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you could explain a bit more: What is the default, when should users decrease or increase this number? We can make the assumption that users are familiar with LoRA and would like to try VB-LoRA, they should not have to read the paper to understand the hyper-parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanations.
src/peft/tuners/vblora/config.py
Outdated
the target modules manually. | ||
save_only_topk_weights (`bool`): | ||
Whether to only save the topk weights. Models saved in this mode can be used for merging or inference only, | ||
not for resuming training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention here that the point is to reduce the file size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
src/peft/tuners/vblora/layer.py
Outdated
# Check for infinity values when training. If found, training was likely resumed from a `save_only_topk_weights` model. | ||
if self.training and vblora_logits_A[0, 0].isinf().any(): | ||
raise RuntimeError( | ||
"Found infinity values in logits. Ensure training was not resumed from a `save_only_topk_weights` model." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add: "VB-LoRA logits", otherwise this could be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Just something that came to my mind out of curiosity: Did you try experiments with multiple VB-LoRA adapters (for different tasks) that share the same vector bank? So e.g. train the logits and vector bank on task 1, then freeze the vector bank. Next, create a new VB-LoRA for task 2 that uses the frozen vector bank from task 1 and only trains logits? Probably this won't work well enough for task 2 but it would be really cool if it did to save even more parameters. |
Hi @BenjaminBossan , That's a great question! Training in an MTL setting is currently an area of ongoing research for us, and we have tried the exact same idea. Preliminary results indicate that when the two tasks are similar (for example, RTE and MRPC from the GLUE benchmark), the second task can still achieve decent performance with a frozen vector bank, although there is a noticeable drop in performance (e.g., 2%). This performance gap tends to decrease when we alternate training between tasks. However, a more robust training strategy for an MTL setting still needs to be explored. I hope this answered your question. I updated the code based on your review. Moreover, I added a new feature (count VB-LoRA savable parameters) with tests to the code, and it is ready for review. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the latest updates. Also, good idea to add the methods to count the number of savable parameters.
I found a small issue not being fully addressed yet in a few tests, the rest looks good.
Btw. if you just push on top of the PR without rebase, it makes it much easier to review for me. There is no need for a clean git history, as I'll squash before merging.
Regarding the sharing of the vector bank: Thanks for the info. If the new research is finished and you found a nice way to share the vector bank, it would be nice to add this feature here. I think it shouldn't require many adjustments to implement.
|
||
assert os.path.exists(save_path / "adapter_config.json") | ||
|
||
mlp_vblora_loaded = PeftModel.from_pretrained(mlp, save_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not quite resolved. The mlp
you used here is the same that you used above when calling mlp_vblora = get_peft_model(mlp, config)
. This means this is the mutated mlp
, as get_peft_model
will perform in-place operations on it. What I mean is that before calling PeftModel.from_pretrained(mlp, save_path)
, you should create a fresh instance of mlp
. I typically also delete the old instance if it's no longer needed, just to make it clear.
mlp_vblora_loaded = PeftModel.from_pretrained(mlp, save_path) | |
del mlp | |
mlp = self.get_mlp() | |
mlp_vblora_loaded = PeftModel.from_pretrained(mlp, save_path) |
The same should be done for the tests below.
Thanks for pointing out the mistake. I have fixed it. I've also added a new test, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this excellent addition to PEFT. Great PR, well implemented and tested right from the start.
I checked out the code and ran all tests on GPU, and they pass without trouble. Line coverage is also good, so this can be merged!
Thank you @BenjaminBossan! The code quality has improved a lot thanks to your insightful review. |
Implements "VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks" https://arxiv.org/abs/2405.15179
…oss_enabled` is set to True. (huggingface#2039) * fix: prevent unpackaging error due to additional **aux_loss** returned by **concatenated_forward** function when **aux_loss_enabled** is set to True. * Refactor: Simplify tuple unpacking in `concatenated_forward` call in `get_batch_loss_metrics` function * Refactor: improve code quality
Hi,
I am one of the authors of VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks. You can find our paper here.
This PR integrates our method, VB-LoRA. The implementation and tests follow VeRA and LoRA. I have run pytest locally, and all the tests have passed.