[Feature] offload inference for Big Model parameters out of npu memory


When using MindNLP, if the model parameters cannot fully fit into the NPU memory, it seems there is currently no mechanism to offload parameters to the CPU or disk. This causes memory overflow issues when loading large models.

I would like MindNLP to support parameter offloading and on-demand loading — similar to the “device_map” and “offload_folder” features in Hugging Face Transformers — so that parts of the model can stay on CPU or disk and be dynamically moved to NPU during inference or training.


Manually splitting the model and transferring parameters between CPU and NPU layer by layer, but this approach is inefficient and difficult to manage for large-scale models.

Additional context
It would be helpful if MindNLP could provide an automatic or semi-automatic offload mechanism for large models that cannot fully fit into NPU memory, as shown in the highlighted code snippet below.

<img width="1686" height="815" alt="Image" src="https://github.com/user-attachments/assets/3c100c7c-2191-4cb8-ae44-efa85ce7a9f9" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] offload inference for Big Model parameters out of npu memory #2238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] offload inference for Big Model parameters out of npu memory #2238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions