LAMMPS aborted unexpectedly #4658
Unanswered
walk-for-me
asked this question in
Q&A
Replies: 1 comment 2 replies
-
See tensorflow/tensorflow#41532. You seem to have to limit the threads to what your system allows. Also, threads may compete if there are too many. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I used DeepMD-Kit v3.0.0 to train a DP model, and when using Deepmdkit's built-in Lammps to run MD, I encountered the situation that it was terminated before running a step. I do not know the reason, and I would like to ask if there are any predecessors who understand and ask for your advice, thank you.
The following is a specific description of the problem:
1. Output file:
a.This is the error log of the slurm commit script in the test.slurm.e file:
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. DeePMD-kit WARNING: Environmental variable Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.528985: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.528911: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.578104: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
2025-03-14 19:00:33.578108: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
2025-03-14 19:00:33.578109: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
b.This is the error message in the out.log file of LAMMPS's output log:
WARNING: Triclinic box skew is large. LAMMPS will run inefficiently. (src/domain.cpp:221)
4 by 4 by 4 MPI processor grid
reading atoms ...
54 atoms
read_data CPU = 0.218 seconds
Summary of lammps deepmd module ...
c.In addition, a large number of core. files were generated. These files contain a bunch of garbled characters, and each of these files is over 600MB in size.
2.Input file
a.This is my in file and Slurm script:
units metal
dimension 3
boundary p p p
atom_style atomic
timestep 0.001
neighbor 1.0 bin
neigh_modify every 1 delay 0 check yes
thermo 10000
read_data 123
pair_style deepmd graph-compress.pb
pair_coeff * *
variable N equal step
variable pote equal pe
variable kin equal ke
variable T equal temp
variable Press equal press
variable V equal vol
fix 1 all box/relax iso 100
min_style cg
minimize 1.0e-12 1.0e-12 10000 100000
unfix 1
reset_timestep 0
velocity all create 300 4928459 dist gaussian
#relax
restart 20000 restart.*.prepare
fix extra all print 20000 "${N} ${T} ${V} ${pote} ${kin} ${Press}" file data.prepare
dump 1 all custom 20000 dump.relax id element x y z
dump_modify 1 element Ti H
fix 1 all nvt temp 300 300 4
run 2000
unfix extra
undump 1
unfix 1
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=cpu-1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH --error=%j.err
#SBATCH --output=%j.out
#SBATCH --comment=ddli_p1
#source /opt/intel/bin/compilervars.sh intel64
mpirun -n 64 lmp_mpi -in in.lammps > out.log
b.There are only 54 atoms in my structure file "123". Here are my parallel-related settings:
export TF_ENABLE_ONEDNN_OPTS=0
export OMP_NUM_THREADS=1
#export DP_INTRA_OP_PARALLELISM_THREADS=8
#export DP_INTER_OP_PARALLELISM_THREADS=2
test.zip
Beta Was this translation helpful? Give feedback.
All reactions