-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
When I load a base model with AutoModelForCausalLM and then apply a LoRA adapter using PeftModel.from_pretrained, the adapter works correctly on GPU0, but on GPU1 the model behaves like the original base model (no adapter effect).
No error is raised — inference runs normally, but outputs match the base model instead of the adapted model.
Here is my code:
print("Using PEFT model for inference.")
tokenizer = AutoTokenizer.from_pretrained(adapter_model_path, token=hf_token, trust_remote_code=True, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_name, token=hf_token, dtype=dtype, device_map=device)
model.resize_token_embeddings(len(tokenizer))
adapter_path = adapter_model_path + "/adapter_model"
print(f"Loading adapter model from {adapter_path}")
model = PeftModel.from_pretrained(model, adapter_path)
model.eval()
Environment
Thu Sep 18 09:02:23 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01 Driver Version: 535.247.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:27:00.0 Off | 0 |
| N/A 62C P0 292W / 300W | 67975MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:38:00.0 Off | 0 |
| N/A 65C P0 291W / 300W | 18560MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
Expected behavior
When specifying device="cuda:1", the adapter should be applied and inference results should match the adapted model (just like on GPU0).