huggingface
diff --git a/‎docs/source/_toctree.yml‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/_toctree.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/developer_guides/lora.md‎
Lines changed: 53 additions & 9 deletions b/‎docs/source/developer_guides/lora.md‎
Lines changed: 53 additions & 9 deletions
diff --git a/‎docs/source/package_reference/trainable_tokens.md‎
Lines changed: 42 additions & 0 deletions b/‎docs/source/package_reference/trainable_tokens.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎src/peft/__init__.py‎
Lines changed: 4 additions & 0 deletions b/‎src/peft/__init__.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎src/peft/mixed_model.py‎
Lines changed: 1 addition & 1 deletion b/‎src/peft/mixed_model.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/peft/peft_model.py‎
Lines changed: 45 additions & 5 deletions b/‎src/peft/peft_model.py‎
Lines changed: 45 additions & 5 deletions
diff --git a/‎src/peft/tuners/__init__.py‎
Lines changed: 3 additions & 0 deletions b/‎src/peft/tuners/__init__.py‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎src/peft/tuners/lora/config.py‎
Lines changed: 22 additions & 0 deletions b/‎src/peft/tuners/lora/config.py‎
Lines changed: 22 additions & 0 deletions
@@ -122,6 +122,8 @@
       title: CPT
     - local: package_reference/bone
       title: Bone
+    - local: package_reference/trainable_tokens
+      title: Trainable Tokens
 
     title: Adapters
   - sections:
 
@@ -41,7 +41,7 @@ config = LoraConfig(init_lora_weights=False, ...)
 ```
 
 ### PiSSA
-[PiSSA](https://arxiv.org/abs/2404.02948) initializes the LoRA adapter using the principal singular values and singular vectors. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements. 
+[PiSSA](https://arxiv.org/abs/2404.02948) initializes the LoRA adapter using the principal singular values and singular vectors. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements.
 
 Configure the initialization method to "pissa", which may take several minutes to execute SVD on the pre-trained model:
 ```python
@@ -50,18 +50,18 @@ config = LoraConfig(init_lora_weights="pissa", ...)
 ```
 Alternatively, execute fast SVD, which takes only a few seconds. The number of iterations determines the trade-off between the error and computation time:
 ```python
-lora_config = LoraConfig(init_lora_weights="pissa_niter_[number of iters]", ...) 
+lora_config = LoraConfig(init_lora_weights="pissa_niter_[number of iters]", ...)
 ```
 For detailed instruction on using PiSSA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/pissa_finetuning).
 
 ### CorDA
 
 [CorDA](https://arxiv.org/pdf/2406.05223) builds task-aware LoRA adapters from weight decomposition oriented by the context of downstream task to learn (instruction-previewed mode, IPM) or world knowledge to maintain (knowledge-preserved mode, KPM).
 The KPM not only achieves better performance than LoRA on fine-tuning tasks, but also mitigates the catastrophic forgetting of pre-trained world knowledge.
-When preserving pre-trained knowledge is not a concern, 
-the IPM is favored because it can further accelerate convergence and enhance the fine-tuning performance. 
+When preserving pre-trained knowledge is not a concern,
+the IPM is favored because it can further accelerate convergence and enhance the fine-tuning performance.
 
-You need to configure the initialization method to "corda", and specify the mode of IPM or KPM and the dataset to collect covariance matrices. 
+You need to configure the initialization method to "corda", and specify the mode of IPM or KPM and the dataset to collect covariance matrices.
 
 ```py
 @torch.no_grad()
@@ -201,7 +201,7 @@ model = PeftModel.from_pretrained(base_model, peft_model_id, ephemeral_gpu_offlo
 ```
 
 DoRA is optimized (computes faster and takes less memory) for models in the evaluation mode, or when dropout is set to 0. We reuse the
-base result at those times to get the speedup. 
+base result at those times to get the speedup.
 Running [dora finetuning](https://github.com/huggingface/peft/blob/main/examples/dora_finetuning/dora_finetuning.py)
 with `CUDA_VISIBLE_DEVICES=0 time python examples/dora_finetuning/dora_finetuning.py --quantize --lora_dropout 0 --batch_size 16 --eval_step 2 --use_dora`
 on a 4090 with gradient accumulation set to 2 and max step to 20 resulted with the following observations:
@@ -215,7 +215,7 @@ on a 4090 with gradient accumulation set to 2 and max step to 20 resulted with t
 #### Caveats
 
 - DoRA only supports embedding, linear, and Conv2d layers at the moment.
-- DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference, see [`LoraModel.merge_and_unload`]. 
+- DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference, see [`LoraModel.merge_and_unload`].
 - DoRA should work with weights quantized with bitsandbytes ("QDoRA"). However, issues have been reported when using QDoRA with DeepSpeed Zero2.
 
 ### QLoRA-style training
@@ -272,6 +272,50 @@ trainer = Trainer(
 )
 ```
 
+## Efficiently train tokens alongside LoRA
+
+Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of other tokens alongside fine-tuning of specific layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched. This saves memory and doesn't throw away learned context of existing token embeddings in contrast to when training the whole embedding matrix. Under the hood this method uses the layer of [`TrainableTokensModel`].
+
+```py
+# for layer 'embed_tokens'
+config = LoraConfig(trainable_token_indices=[idx_1, idx_2, ...], ...)
+
+# specific embedding layer
+config = LoraConfig(trainable_token_indices={'emb_tokens': [idx_1, idx_2, ...]}, ...)
+```
+
+In the snippet below we show how to add new tokens to the model and how to train it alongside the other layers in the model.
+
+```py
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import get_peft_model, LoraConfig
+
+base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
+tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
+
+# we define our new tokens and add them to the tokenizer as special tokens
+special_tokens = ['<|start_think|>', '<|stop_think|>']
+tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
+
+# make room for new tokens in the embedding matrix if it isn't big enough already
+base_model.resize_token_embeddings(max(len(tokenizer), base_model.model.embed_tokens.num_embeddings)
+
+# typical LoRA config with `trainable_token_indices` targeting embedding layer `embed_tokens`
+# and specifically our new tokens we just added
+lora_config = LoraConfig(
+    target_modules='all-linear',
+    trainable_token_indices={'embed_tokens': tokenizer.convert_tokens_to_ids(special_tokens)},
+)
+peft_model = get_peft_model(base_model, lora_config)
+
+# proceed to train the model like normal
+[...]
+```
+
+The token weights are part of your adapter state dict and saved alongside the LoRA weights.
+If we would have used full fine-tuning with `modules_to_save=['embed_tokens']` we would have stored the full embedding matrix in the checkpoint, leading to a much bigger file.
+
+
 ## Merge LoRA weights into the base model
 
 While LoRA is significantly smaller and faster to train, you may encounter latency issues during inference due to separately loading the base model and the LoRA adapter. To eliminate latency, use the [`~LoraModel.merge_and_unload`] function to merge the adapter weights with the base model. This allows you to use the newly merged model as a standalone model. The [`~LoraModel.merge_and_unload`] function doesn't keep the adapter weights in memory.
@@ -323,7 +367,7 @@ base_model = AutoModelForCausalLM.from_pretrained(
 )
 ```
 
-Then we load the first adapter: 
+Then we load the first adapter:
 
 ```python
 peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
@@ -443,7 +487,7 @@ output = peft_model.generate(**inputs, adapter_names=adapter_names, max_new_toke
 
 Note that the order does not matter here, i.e. the samples in the batch don't need to be grouped by adapter as in the example above. We just need to ensure that the `adapter_names` argument is aligned correctly with the samples.
 
-Additionally, the same approach also works with the `modules_to_save` feature, which allows for saving and reusing specific neural network layers, such as custom heads for classification tasks, across different LoRA adapters. 
+Additionally, the same approach also works with the `modules_to_save` feature, which allows for saving and reusing specific neural network layers, such as custom heads for classification tasks, across different LoRA adapters.
 
 ### Caveats
 
 
@@ -0,0 +1,42 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Trainable Tokens
+
+The Trainable Tokens method provides a way to target specific token embeddings for fine-tuning without resorting to
+training the full embedding matrix or using an adapter on the embedding matrix. It is based on the initial implementation from
+[here](https://github.com/huggingface/peft/pull/1541).
+
+The method only targets specific tokens and selectively trains the token indices you specify. Consequently the
+required RAM will be lower and disk memory is also significantly lower than storing the full fine-tuned embedding matrix.
+
+Some preliminary benchmarks acquired with [this script](https://github.com/huggingface/peft/blob/main/scripts/train_memory.py)
+suggest that for `gemma-2-2b` (which has a rather large embedding matrix) you can save 4.8GiB VRAM with Trainable Tokens
+over fully fine-tuning the embedding matrix. While LoRA will use even less memory (-6.3GiB total over fine-tuning) it might also target
+tokens you don't want to be changed. With less extreme embedding matrixes the difference might come out shorter as well.
+
+Note that this method does not add tokens for you, you have to add tokens to the tokenizer yourself and resize the
+embedding matrix of the model accordingly. This method will only re-train the embeddings for the tokens you specify.
+This method can also be used in conjunction with LoRA layers! See [the LoRA developer guide](../developer_guides/lora#efficiently-train-tokens-alongside-lora).
+
+## TrainableTokensConfig
+
+[[autodoc]] tuners.trainable_tokens.config.TrainableTokensConfig
+
+## TrainableTokensModel
+
+[[autodoc]] tuners.trainable_tokens.model.TrainableTokensModel
+
@@ -87,6 +87,8 @@
     PromptEncoderReparameterizationType,
     PromptTuningConfig,
     PromptTuningInit,
+    TrainableTokensConfig,
+    TrainableTokensModel,
     VBLoRAConfig,
     VBLoRAModel,
     VeraConfig,
@@ -177,6 +179,8 @@
     "PromptTuningConfig",
     "PromptTuningInit",
     "TaskType",
+    "TrainableTokensConfig",
+    "TrainableTokensModel",
     "VBLoRAConfig",
     "VBLoRAConfig",
     "VBLoRAModel",
 
@@ -251,7 +251,7 @@ def set_modules_to_save(self, peft_config: PeftConfig, adapter_name: str) -> Non
             self.modules_to_save = set(modules_to_save)
         else:
             self.modules_to_save.update(modules_to_save)
-        _set_trainable(self, adapter_name, modules_to_save=peft_config.modules_to_save)
+        _set_trainable(self, adapter_name, module_names=peft_config.modules_to_save)
 
     def set_adapter(self, adapter_name: Union[str, list[str]]) -> None:
         """
 
@@ -41,6 +41,7 @@
 from peft.tuners.tuners_utils import BaseTuner, BaseTunerLayer
 from peft.utils.constants import DUMMY_MODEL_CONFIG
 from peft.utils.integrations import init_empty_weights
+from peft.utils.other import TrainableTokensWrapper
 
 from . import __version__
 from .config import PeftConfig
@@ -128,7 +129,8 @@ def __init__(
             ctx = init_empty_weights if low_cpu_mem_usage else nullcontext
             with ctx():
                 self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
-            self.set_additional_trainable_modules(peft_config, adapter_name)
+
+        self.set_additional_trainable_modules(peft_config, adapter_name)
 
         if hasattr(self.base_model, "_cast_adapter_dtype"):
             self.base_model._cast_adapter_dtype(
@@ -950,7 +952,45 @@ def set_additional_trainable_modules(self, peft_config, adapter_name):
             else:
                 self.modules_to_save.update(peft_config.modules_to_save)
             # this may add a new ModulesToSaveWrapper
-            _set_trainable(self, adapter_name, modules_to_save=peft_config.modules_to_save)
+            _set_trainable(self, adapter_name, module_names=peft_config.modules_to_save)
+
+        if getattr(peft_config, "trainable_token_indices", None) is not None:
+            if isinstance(peft_config.trainable_token_indices, dict):
+                target_layers = peft_config.trainable_token_indices
+            else:
+                target_layers = {"embed_tokens": peft_config.trainable_token_indices}
+
+            if self.modules_to_save:
+                for target_layer in target_layers:
+                    if target_layer in self.modules_to_save:
+                        raise ValueError(
+                            "The embedding layer is already marked to be trained fully, either specify "
+                            f'`modules_to_save=[..., "{target_layer}", ...]` or '
+                            f"`trainable_tokens={{'{target_layer}': x}}` but not both."
+                        )
+
+            # we are not adding these module names to `self.modules_to_save` as this is strictly reserved for the
+            # `ModulesToSaveWrapper`.
+
+            for target_layer, token_indices in target_layers.items():
+                new_training_modules = _set_trainable(
+                    self,
+                    adapter_name,
+                    module_names=[target_layer],
+                    strict_module_check=True,
+                    wrapper_cls=TrainableTokensWrapper,
+                    token_indices=token_indices,
+                )
+
+            # Handle weight-tying of output and input embeddings. Currently this only consists of failing.
+            model_config = BaseTuner.get_model_config(self)
+            if model_config.get("tie_word_embeddings", False) and isinstance(
+                self.model.get_input_embeddings(), TrainableTokensWrapper
+            ):
+                raise ValueError(
+                    "The model uses weight-tying which is currently not supported with `trainable_token_indices`. "
+                    "You can try disabling weight-tying but you must expect an increased memory usage."
+                )
 
     def get_layer_status(self) -> list[TunerLayerStatus]:
         """Get the status of each adapter layer in the model.
@@ -1447,7 +1487,7 @@ def __init__(
                 break
 
         # to make sure classifier layer is trainable; this may add a new ModulesToSaveWrapper
-        _set_trainable(self, adapter_name, modules_to_save=peft_config.modules_to_save)
+        _set_trainable(self, adapter_name, module_names=peft_config.modules_to_save)
 
     def add_adapter(self, adapter_name: str, peft_config: PeftConfig, low_cpu_mem_usage: bool = False) -> None:
         """
@@ -2238,7 +2278,7 @@ def __init__(
                 break
 
         # to make sure classifier layer is trainable; this may add a new ModulesToSaveWrapper
-        _set_trainable(self, adapter_name, modules_to_save=peft_config.modules_to_save)
+        _set_trainable(self, adapter_name, module_names=peft_config.modules_to_save)
 
     def add_adapter(self, adapter_name: str, peft_config: PeftConfig, low_cpu_mem_usage: bool = False) -> None:
         """
@@ -2459,7 +2499,7 @@ def __init__(
                 break
 
         # to make sure classifier layer is trainable; this may add a new ModulesToSaveWrapper
-        _set_trainable(self, adapter_name, modules_to_save=peft_config.modules_to_save)
+        _set_trainable(self, adapter_name, module_names=peft_config.modules_to_save)
 
     def add_adapter(self, adapter_name: str, peft_config: PeftConfig, low_cpu_mem_usage: bool = False) -> None:
         """
 
@@ -39,6 +39,7 @@
 from .poly import PolyConfig, PolyModel
 from .prefix_tuning import PrefixEncoder, PrefixTuningConfig
 from .prompt_tuning import PromptEmbedding, PromptTuningConfig, PromptTuningInit
+from .trainable_tokens import TrainableTokensConfig, TrainableTokensModel
 from .vblora import VBLoRAConfig, VBLoRAModel
 from .vera import VeraConfig, VeraModel
 from .xlora import XLoraConfig, XLoraModel
@@ -88,6 +89,8 @@
     "PromptEncoderReparameterizationType",
     "PromptTuningConfig",
     "PromptTuningInit",
+    "TrainableTokensConfig",
+    "TrainableTokensModel",
     "VBLoRAConfig",
     "VBLoRAModel",
     "VeraConfig",
 
@@ -273,6 +273,14 @@ class LoraConfig(PeftConfig):
             parameter when you want to apply LoRA to the ColumnParallelLinear and RowParallelLinear layers of megatron.
         megatron_core (`Optional[str]`):
             The core module from Megatron to use, defaults to `"megatron.core"`.
+        trainable_token_indices (`Optional[Union[List[int], dict[str, List[int]]]]`)
+            Lets you specify which token indices to selectively fine-tune without requiring to re-train the whole
+            embedding matrix using the `peft.TrainableTokensModel` method. You can either specify a list of indices
+            which will then target the `embed_tokens` layer, or, if your model is using a different layer for
+            embedding, you can specify a dictionary where the key is the name of the embedding module and the values
+            are the list of token indices, e.g. `{'embed_tokens': [0, 1, ...]}`. Note that training with FSDP/DeepSpeed
+            might not yet be fully supported with this option enabled. Also note that models using weight-tying are
+            currently not supported.
         loftq_config (`Optional[LoftQConfig]`):
             The configuration of LoftQ. If this is not None, then LoftQ will be used to quantize the backbone weights
             and initialize Lora layers. Also pass `init_lora_weights='loftq'`. Note that you should not pass a
@@ -431,6 +439,20 @@ class LoraConfig(PeftConfig):
             )
         },
     )
+    trainable_token_indices: Optional[Union[list[int], dict[str, list[int]]]] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Lets you specify which token indices to selectively fine-tune without requiring to re-train the "
+                "whole embedding matrix using the `peft.TrainableTokensModel` method. You can either specify a list "
+                "of indices which will then target the `embed_tokens` layer, or, if your model is using a different "
+                "layer for embedding, you can specify a dictionary where the key is the name of the embedding module "
+                "and the values are the list of token indices, e.g. `{'embed_tokens': [0, 1, ...]}`. "
+                "Note that training with FSDP/DeepSpeed might not yet be fully supported with this option enabled. "
+                "Also note that models using weight-tying are currently not supported."
+            )
+        },
+    )
     # dict type is used when loading config.json
     loftq_config: Union[LoftQConfig, dict] = field(
         default_factory=dict,