[BUG] Abnormal file system I/O during PT backend training

### Bug summary

When I ran the DPA2 or DPA3 training using the PT backend, I observed abnormal and continuous **heavy reading** in the BeeGFS file system of the working directory (for single-GPU training, it exceeded **2 Gbps**). Just a few dozen single-GPU training jobs would fill up the 100 Gbps storage node bandwidth, and the training speed would significantly decrease. 

These reading operations did not occur on the HW disks, indicating that the range of data blocks being read is very small and the reading operation hit the RAM cache.

This phenomenon does not exist in the ordinary NFSoRDMA file system.

**Platform**
1. The Open Source Supercomputing Center of S-A-I; 
2. LiuLab-HPC

**File System**
1. BeeGFS 8.1.0 (RoCEv2); 
2. BeeGFS 7.4.5 (IB)

**Netdata Monitor of File System I/O on Compute Nodes**
<img width="1551" height="1346" alt="Image" src="https://github.com/user-attachments/assets/06e695ae-5875-487f-ba1b-9e92da6bf097" />

### DeePMD-kit Version

3.0.0 ~ 3.1.1

### Backend and its version

bundled with all offline packages

### How did you download the software?

Offline packages

### Input Files, Running Commands, Error Log, etc.

Slurm sbtach script:
```
#!/bin/bash
#SBATCH --job-name=DP-Train
#SBATCH --partition=4V100
#SBATCH --nodes=1
#SBATCH --ntasks=1          # Nodes * GPUs-per-node * Ranks-per-GPU
#SBATCH --gpus-per-node=1   # Specify the GPUs-per-node
#SBATCH --qos=improper-gpu  # Depending on your needs [Priority: rush-4gpu = rush-8gpu > improper-gpu > huge-gpu]

export OMP_NUM_THREADS=2

nvidia-smi dmon -s pucvmte -o T > nvdmon_job-$SLURM_JOB_ID.log &

source /opt/envs/deepmd3.1.1.env

export DP_INTERFACE_PREC=low

dp --pt train input.json
```

### Steps to Reproduce

I can provide the supercomputer account for reproducing the problem.

### Further Information, Files, and Links

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Abnormal file system I/O during PT backend training #5008

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Abnormal file system I/O during PT backend training #5008

Description

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions