LoRA and Transformers TP by michaelbenayoun · Pull Request #3079 · huggingface/peft

michaelbenayoun · 2026-03-03T23:41:44Z

The goal of this PR is to integrate Transformers' API for Tensor Parallelism to PEFT, starting by LoRA.

As #3044 pointed out, there are issues.

First, the code used fails. It was due to the fact that to create adapters PEFT looks at the parent modules attributes, and not the actual weights.

For instance for LoRA, it will check the in_features and out_features attributes of torch.nn.Linear, instead of the weight's shape, failing in the case of TP because there is a mismatch. I addressed this issue here: huggingface/transformers#44421.

On top of that, we need to handle a few things to make it work. There are two cases:

Column Linears: in this case, the output is sharded. So our adapters should also produce sharded outputs. We should have lora_A to be a regular non-sharded linear, and lora_B to be a column linear, just as the base layer.
Row Linears: in this case, the input comes sharded, and we should produce a un-sharded output. We should have lora_A to be a row linear, and lora_B to be a regular non-sharded linear.

To do that, we need to do multiple things:

If we create the adapter from scratch (not loaded from an existing checkpoint), then we only need to add the Transformers TensorParallelLayer's hooks
If we load the adapter from an existing checkpoint, then we also need to shard the loaded weights accordingly

This PR provides such features and a test file to check that everything works as expected.

Next: add similar support for the Embedding layer.

BenjaminBossan · 2026-03-04T13:17:55Z

Thanks for taking care of this @michaelbenayoun. LMK if I can help with anything. If you have a minimal example to test this, that would be great.

A different approach that may work is to detect if a TP plan is being used and then initialize the corresponding PEFT layer differently using the sharded layers, I'm not sure what approach would be more robust.

3outeille · 2026-03-09T14:40:38Z

tests/test_lora_tp.py

+    dist.destroy_process_group()
+
+
+def _test_training(rank, world_size, port):


is it possible to test using the following methodology ? https://github.com/huggingface/transformers/blob/main/tests/test_training_mixin.py#L387

take a tiny random model, overfit the same sample abcdefg... for several steps until the loss and grad_norm has decreased of 70% -> save and load -> generate ? (since it has overfit the sample, it should perfectly predict the sequence)

3outeille · 2026-03-09T14:43:51Z

src/peft/tuners/lora/layer.py

+            device_mesh = getattr(base_layer, "_hf_device_mesh", None)
+            if device_mesh is not None and tp_plan in ("colwise", "rowwise"):
+                pg = device_mesh.get_group()
+                src = torch.distributed.get_global_rank(pg, 0)


styling but better to import torch.distributed as dist then use dist (be consistent on the transformer side) ?

3outeille · 2026-03-09T14:46:25Z

Nice PR ! Several questions that pops into my mind:

Regarding TP Row linear + lora, should the all_reduce happens before or after the lora computation ?
Given how importance of MoE as well, that could be good to have a TP + Peft + Moe support, what do you think?

HuggingFaceDocBuilderDev · 2026-03-10T17:42:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

michaelbenayoun · 2026-03-10T17:47:08Z

Nice PR ! Several questions that pops into my mind:

Regarding TP Row linear + lora, should the all_reduce happens before or after the lora computation ?

Given how importance of MoE as well, that could be good to have a TP + Peft + Moe support, what do you think?

TP Row linear + Lora will do this: Lora A gets sharded inputs, computes the output and all reduce, just like a regular RowLinear. Then Lora B gets un-sharded inputs.
I agree, as well as for Embeddings. I suggest we postpone that to other PRs to not make this one grow.

BenjaminBossan

Thanks for adding support for TP to LoRA. I'm not too knowledgeable when it comes to TP, so I can't judge the details there, so this review focuses more on the PEFT integration itself.

Right now, this is very LoRA specific. I think the same idea should work with other PEFT methods too, but it's not quite trivial to write the code in a generic way. So I'm fine with the approach here and we can adjust once/if there is demand for TP in other PEFT methods.

One question that I had is: Do you know the minimum transformers version that would be required to run this? The whole TP module seems to be from one year ago, but I'm not sure if later changes are required for this to actually work. If you know, could you please add a small section to the docs (https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/lora.md) mentioning that TP is supported and requires transformers > x.y.z?

Moreover, although we don't have CI for this, we generally try to support older transformers versions as much as possible. Code like getattr(base_layer, "_hf_tp_plan", None) should always be fine, as this would just return None for older versions. But importing from from transformers.integrations.tensor_parallel would fail. So how about importing it locally, only when needed?

Regarding the CI, it currently fails, most likely because the TP tests are being run on CPU runners. I made some suggestions that should hopefully resolve this. However, even if I run the tests locally on a machine with 2 GPUs, I get an error:

E           tp_base = AutoModelForCausalLM.from_pretrained(MODEL_ID, tp_plan="auto")
E         File "/home/name/work/forks/transformers/src/transformers/models/auto/auto_factory.py", line 381, in from_pretrained
E           return model_class.from_pretrained(
E                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
E               pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
E               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           )
E           ^
E         File "/home/name/work/forks/transformers/src/transformers/modeling_utils.py", line 3989, in from_pretrained
E           device_map, device_mesh, tp_size = initialize_tensor_parallelism(
E                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
E               tp_plan, tp_size=tp_size, device_mesh=device_mesh, device_map=device_map
E               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           )
E           ^
E         File "/home/name/work/forks/transformers/src/transformers/integrations/tensor_parallel.py", line 81, in initialize_tensor_parallelism
E           current_device.set_device(int(os.environ["LOCAL_RANK"]))
E                                         ~~~~~~~~~~^^^^^^^^^^^^^^
E         File "<frozen os>", line 717, in __getitem__
E       KeyError: 'LOCAL_RANK'

BenjaminBossan · 2026-03-12T11:12:22Z

src/peft/utils/save_and_load.py

+
+                    # We create and initialize the TensorParallelLayer on the fly,
+                    # and we set the `empty_param` attribute depending on the proper
+                    # state dict key to shard


Could you please extend this comment as to why this is needed?

BenjaminBossan · 2026-03-12T11:14:05Z

src/peft/utils/save_and_load.py

+                    if isinstance(tp_layer, ColwiseParallel):
+                        key = f"{name}.lora_B.{adapter_name}.weight"
+                        tp_layer.empty_param = peft_model_state_dict[key]
+                        peft_model_state_dict[key] = tp_layer.shard_tensor(
+                            peft_model_state_dict[key], device=device, dtype=dtype
+                        )
+                    elif isinstance(tp_layer, RowwiseParallel):
+                        key = f"{name}.lora_A.{adapter_name}.weight"
+                        tp_layer.empty_param = peft_model_state_dict[key]
+                        peft_model_state_dict[key] = tp_layer.shard_tensor(
+                            peft_model_state_dict[key], device=device, dtype=dtype
+                        )


Should do the same, right?

Suggested change

if isinstance(tp_layer, ColwiseParallel):

key = f"{name}.lora_B.{adapter_name}.weight"

tp_layer.empty_param = peft_model_state_dict[key]

peft_model_state_dict[key] = tp_layer.shard_tensor(

peft_model_state_dict[key], device=device, dtype=dtype

)

elif isinstance(tp_layer, RowwiseParallel):

key = f"{name}.lora_A.{adapter_name}.weight"

tp_layer.empty_param = peft_model_state_dict[key]

peft_model_state_dict[key] = tp_layer.shard_tensor(

peft_model_state_dict[key], device=device, dtype=dtype

)

if isinstance(tp_layer, ColwiseParallel):

key = f"{name}.lora_B.{adapter_name}.weight"

elif isinstance(tp_layer, RowwiseParallel):

key = f"{name}.lora_A.{adapter_name}.weight"

tp_layer.empty_param = peft_model_state_dict[key]

peft_model_state_dict[key] = tp_layer.shard_tensor(

peft_model_state_dict[key], device=device, dtype=dtype

)

BenjaminBossan · 2026-03-12T11:23:48Z

src/peft/tuners/lora/model.py

+            if tp_plan == "colwise":
+                add_tensor_parallel_hooks_to_module(
+                    self.model,
+                    lora_module.lora_B[adapter_name],
+                    tp_plan,
+                    f"{current_key}.lora_B.{adapter_name}",
+                    tp_plan,
+                    device_mesh,
+                )
+            elif tp_plan == "rowwise":
+                add_tensor_parallel_hooks_to_module(
+                    self.model,
+                    lora_module.lora_A[adapter_name],
+                    tp_plan,
+                    f"{current_key}.lora_A.{adapter_name}",
+                    tp_plan,
+                    device_mesh,
+                )


Should do the same but with less repetition, right?

Suggested change

if tp_plan == "colwise":

add_tensor_parallel_hooks_to_module(

self.model,

lora_module.lora_B[adapter_name],

tp_plan,

f"{current_key}.lora_B.{adapter_name}",

tp_plan,

device_mesh,

)

elif tp_plan == "rowwise":

add_tensor_parallel_hooks_to_module(

self.model,

lora_module.lora_A[adapter_name],

tp_plan,

f"{current_key}.lora_A.{adapter_name}",

tp_plan,

device_mesh,

)

if tp_plan == "colwise":

tp_module = lora_module.lora_B[adapter_name]

tp_layer_name = f"{current_key}.lora_B.{adapter_name}",

else:

tp_module = lora_module.lora_A[adapter_name],

tp_layer_name = f"{current_key}.lora_A.{adapter_name}",

add_tensor_parallel_hooks_to_module(

model=self.model,

module=tp_module,

tp_plan=tp_plan,

layer_name=tp_layer_name,

current_module_plan=tp_plan,

device_mesh=device_mesh,

)

Also, it looks like the tp_plan argument to add_tensor_parallel_hooks_to_module is not used at all, why is it needed?

You are right about your question about the tp_plan parameter. I opened a PR here: huggingface/transformers#44768. It should be merge before the release so I will make sure to include the changes here if it happens.

Other refactor mentioned, done.

BenjaminBossan · 2026-03-12T11:44:13Z

tests/test_lora_tp.py

@@ -0,0 +1,393 @@
+# Copyright 2025-present the HuggingFace Inc. team.


Suggested change

# Copyright 2025-present the HuggingFace Inc. team.

# Copyright 2026-present the HuggingFace Inc. team.

BenjaminBossan · 2026-03-12T11:46:03Z

tests/test_lora_tp.py

+MODEL_ID = "Qwen/Qwen3-0.6B"
+TINY_MODEL_ID = "amazingvince/zephyr-smol_llama-100m-sft-full"


Is there a specific reason to use these models? Otherwise, I'd like to move to models we're already using as they are already cached (the CI is already close to triggering rate limits from the Hub so we have to be careful).

Alright I get you.
For MODEL_ID, the idea is just to use a small qwen3 model. We can use any LLM tbh.

For TINY_MODEL_ID, it is a bit more nuanced. Because we are running training steps, and we check that overfitting happens, we cannot simply take a tiny randomly initialized model because there's a glass ceiling in what the LoRA adapters can adapt when the base model is just full of garbage. So we need an actual train model, small enough to run fast in the CI, and I managed to find this small 100m finetuned model.

BenjaminBossan · 2026-03-12T11:49:01Z

tests/test_lora_tp.py

+        _teardown_dist()
+
+
+def _test_training_overfit(rank, world_size, port):


This looks more like an integration test to me. We could put a separate script for this into tests/training/ and then invoke it here:

peft/Makefile

Line 65 in 2513f57

tests_training:

This ensures it's only run in the correct context and when it's needed (not part of the regular CI, which runs on CPU).

BenjaminBossan · 2026-03-12T11:51:40Z

tests/test_lora_tp.py

+    logger.info(f"{Colors.GREEN}✓ Generated sequence matches training sequence{Colors.RESET}")
+
+
+def _test_lora_weight_synchronization(rank, world_size, port):


This test and the rest below should be put into tests/test_gpu_examples.py. Let's put them all in the same test class to make it clear they belong together. Then decorate the class with @pytest.mark.multi_gpu_tests. That way, we know that the tests only run on the multi GPU runner.

It also works on any runner with multiple CPU cores. But I did exactly as you suggested.

BenjaminBossan · 2026-03-12T11:53:19Z

tests/test_lora_tp.py

+@unittest.skipUnless(_is_tp_available(), "transformers TP integration not available")
+class TestLoraTP(unittest.TestCase):


Suggested change

@unittest.skipUnless(_is_tp_available(), "transformers TP integration not available")

class TestLoraTP(unittest.TestCase):

@pytest.mark.skipf(not _is_tp_available(), reason="transformers TP integration not available")

class TestLoraTensorParallel:

As we're away from unittest.

wip(tp): add hooks to LoRA adapters for TP

e90907a

michaelbenayoun mentioned this pull request Mar 3, 2026

TP support for Finetuning using LoRA and other PEFT techniques #3044

Open

michaelbenayoun added 2 commits March 4, 2026 18:46

feat: add hooks to LoRA adapters for TP plan

17b96ee

wip: shard LoRA adapters for TP

8e8ea68

michaelbenayoun force-pushed the lora_and_tp branch from be16253 to 8e8ea68 Compare March 4, 2026 23:46

michaelbenayoun added 9 commits March 6, 2026 13:25

wip: add TP hooks to adapters

a480695

feat: add hooks for TP in LoraModel

744a0a5

fix: add lora adapter weight broadcasting after initialization

d934e4c

style: add comments and remove space

99260d1

fix: load and shard from checkpoints with TP

03c94fb

test: add test suites for LoRA + Transformers TP

69e8029

refactor: rename test

67d1b39

refactor: rename test file

7000ae8

style: fix length

4cb247b

michaelbenayoun requested a review from BenjaminBossan March 7, 2026 00:02

michaelbenayoun marked this pull request as ready for review March 7, 2026 00:03

3outeille reviewed Mar 9, 2026

View reviewed changes

michaelbenayoun added 2 commits March 10, 2026 13:37

test: add overfitting test

e9536e9

fix: remove comma and unnecessary if statement

1c6d4a0

style: torch.distributed to dist

25e7926

michaelbenayoun added 2 commits March 11, 2026 11:09

Merge branch 'main' into lora_and_tp

4ce5ba3

fix: move adapters to device before broadcast

38865ff

BenjaminBossan requested changes Mar 12, 2026

View reviewed changes

michaelbenayoun mentioned this pull request Mar 12, 2026

Embeddings LoRA & TP #3091

Open

michaelbenayoun added 2 commits March 16, 2026 14:11

fix: check for correct transformers version

dbcbbdd

fix: lazy import to avoid failing with older transformers versions

fe77349

michaelbenayoun added 9 commits March 16, 2026 14:12

doc: mention TP support in LoRA docs

07daec8

style: extend comment on empty_param

786fceb

refactor: remove duplicated code

4fa6682

refactor: remove duplicated code when adding the hooks

ed5e33c

style: ruff format

89b3f35

test: move tp tests to gpu

0d238ec

test: remove test file

fb0e31a

fix: typos and arguments

e254738

fix: restore pyproject.toml

32eeb5a

		dist.destroy_process_group()


		def _test_training(rank, world_size, port):

		@@ -0,0 +1,393 @@
		# Copyright 2025-present the HuggingFace Inc. team.

	# Copyright 2025-present the HuggingFace Inc. team.
	# Copyright 2026-present the HuggingFace Inc. team.

		MODEL_ID = "Qwen/Qwen3-0.6B"
		TINY_MODEL_ID = "amazingvince/zephyr-smol_llama-100m-sft-full"

		_teardown_dist()


		def _test_training_overfit(rank, world_size, port):

		logger.info(f"{Colors.GREEN}✓ Generated sequence matches training sequence{Colors.RESET}")


		def _test_lora_weight_synchronization(rank, world_size, port):

		@unittest.skipUnless(_is_tp_available(), "transformers TP integration not available")
		class TestLoraTP(unittest.TestCase):

Conversation

michaelbenayoun commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille commented Mar 9, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2026

Uh oh!

michaelbenayoun commented Mar 10, 2026

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelbenayoun commented Mar 3, 2026 •

edited

Loading