Skip to content

[BUG] Abnormal file system I/O during PT backend training #5008

@Entropy-Enthalpy

Description

@Entropy-Enthalpy

Bug summary

When I ran the DPA2 or DPA3 training using the PT backend, I observed abnormal and continuous heavy reading in the BeeGFS file system of the working directory (for single-GPU training, it exceeded 2 Gbps). Just a few dozen single-GPU training jobs would fill up the 100 Gbps storage node bandwidth, and the training speed would significantly decrease.

These reading operations did not occur on the HW disks, indicating that the range of data blocks being read is very small and the reading operation hit the RAM cache.

This phenomenon does not exist in the ordinary NFSoRDMA file system.

Platform

  1. The Open Source Supercomputing Center of S-A-I;
  2. LiuLab-HPC

File System

  1. BeeGFS 8.1.0 (RoCEv2);
  2. BeeGFS 7.4.5 (IB)

Netdata Monitor of File System I/O on Compute Nodes
Image

DeePMD-kit Version

3.0.0 ~ 3.1.1

Backend and its version

bundled with all offline packages

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Slurm sbtach script:

#!/bin/bash
#SBATCH --job-name=DP-Train
#SBATCH --partition=4V100
#SBATCH --nodes=1
#SBATCH --ntasks=1          # Nodes * GPUs-per-node * Ranks-per-GPU
#SBATCH --gpus-per-node=1   # Specify the GPUs-per-node
#SBATCH --qos=improper-gpu  # Depending on your needs [Priority: rush-4gpu = rush-8gpu > improper-gpu > huge-gpu]

export OMP_NUM_THREADS=2

nvidia-smi dmon -s pucvmte -o T > nvdmon_job-$SLURM_JOB_ID.log &

source /opt/envs/deepmd3.1.1.env

export DP_INTERFACE_PREC=low

dp --pt train input.json

Steps to Reproduce

I can provide the supercomputer account for reproducing the problem.

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions