Skip to content

PeftModel.from_pretrained adapter not applied when using device="cuda:1" #2787

@Xxxxsir

Description

@Xxxxsir

When I load a base model with AutoModelForCausalLM and then apply a LoRA adapter using PeftModel.from_pretrained, the adapter works correctly on GPU0, but on GPU1 the model behaves like the original base model (no adapter effect).

No error is raised — inference runs normally, but outputs match the base model instead of the adapted model.

Here is my code:

        print("Using PEFT model for inference.")
        tokenizer = AutoTokenizer.from_pretrained(adapter_model_path, token=hf_token, trust_remote_code=True, use_fast=True)
        model = AutoModelForCausalLM.from_pretrained(model_name, token=hf_token, dtype=dtype, device_map=device)
        model.resize_token_embeddings(len(tokenizer))

        adapter_path = adapter_model_path + "/adapter_model"
        print(f"Loading adapter model from {adapter_path}")
        model = PeftModel.from_pretrained(model, adapter_path)
        model.eval()

Environment

Thu Sep 18 09:02:23 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01 Driver Version: 535.247.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:27:00.0 Off | 0 |
| N/A 62C P0 292W / 300W | 67975MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:38:00.0 Off | 0 |
| N/A 65C P0 291W / 300W | 18560MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

Expected behavior
When specifying device="cuda:1", the adapter should be applied and inference results should match the adapted model (just like on GPU0).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions