-
Notifications
You must be signed in to change notification settings - Fork 575
Description
Bug summary
When I ran the DPA2 or DPA3 training using the PT backend, I observed abnormal and continuous heavy reading in the BeeGFS file system of the working directory (for single-GPU training, it exceeded 2 Gbps). Just a few dozen single-GPU training jobs would fill up the 100 Gbps storage node bandwidth, and the training speed would significantly decrease.
These reading operations did not occur on the HW disks, indicating that the range of data blocks being read is very small and the reading operation hit the RAM cache.
This phenomenon does not exist in the ordinary NFSoRDMA file system.
Platform
- The Open Source Supercomputing Center of S-A-I;
- LiuLab-HPC
File System
- BeeGFS 8.1.0 (RoCEv2);
- BeeGFS 7.4.5 (IB)
Netdata Monitor of File System I/O on Compute Nodes
DeePMD-kit Version
3.0.0 ~ 3.1.1
Backend and its version
bundled with all offline packages
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Slurm sbtach script:
#!/bin/bash
#SBATCH --job-name=DP-Train
#SBATCH --partition=4V100
#SBATCH --nodes=1
#SBATCH --ntasks=1 # Nodes * GPUs-per-node * Ranks-per-GPU
#SBATCH --gpus-per-node=1 # Specify the GPUs-per-node
#SBATCH --qos=improper-gpu # Depending on your needs [Priority: rush-4gpu = rush-8gpu > improper-gpu > huge-gpu]
export OMP_NUM_THREADS=2
nvidia-smi dmon -s pucvmte -o T > nvdmon_job-$SLURM_JOB_ID.log &
source /opt/envs/deepmd3.1.1.env
export DP_INTERFACE_PREC=low
dp --pt train input.json
Steps to Reproduce
I can provide the supercomputer account for reproducing the problem.
Further Information, Files, and Links
No response