LAMMPS aborted unexpectedly #4658
Unanswered
walk-for-me
asked this question in
Q&A
Replies: 1 comment 2 replies
-
See tensorflow/tensorflow#41532. You seem to have to limit the threads to what your system allows. Also, threads may compete if there are too many. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I used DeepMD-Kit v3.0.0 to train a DP model, and when using Deepmdkit's built-in Lammps to run MD, I encountered the situation that it was terminated before running a step. I do not know the reason, and I would like to ask if there are any predecessors who understand and ask for your advice, thank you.
The following is a specific description of the problem:
1. Output file:
a.This is the error log of the slurm commit script in the test.slurm.e file:
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. DeePMD-kit WARNING: Environmental variable Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.528985: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.528911: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.578104: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
2025-03-14 19:00:33.578108: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
2025-03-14 19:00:33.578109: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
b.This is the error message in the out.log file of LAMMPS's output log:
WARNING: Triclinic box skew is large. LAMMPS will run inefficiently. (src/domain.cpp:221)
4 by 4 by 4 MPI processor grid
reading atoms ...
54 atoms
read_data CPU = 0.218 seconds
Summary of lammps deepmd module ...
c.In addition, a large number of core. files were generated. These files contain a bunch of garbled characters, and each of these files is over 600MB in size.
2.Input file
a.This is my in file and Slurm script:
units metal
dimension 3
boundary p p p
atom_style atomic
timestep 0.001
neighbor 1.0 bin
neigh_modify every 1 delay 0 check yes
thermo 10000
read_data 123
pair_style deepmd graph-compress.pb
pair_coeff * *
variable N equal step
variable pote equal pe
variable kin equal ke
variable T equal temp
variable Press equal press
variable V equal vol
fix 1 all box/relax iso 100
min_style cg
minimize 1.0e-12 1.0e-12 10000 100000
unfix 1
reset_timestep 0
velocity all create 300 4928459 dist gaussian
#relax
restart 20000 restart.*.prepare
fix extra all print 20000 "${N} ${T} ${V} ${pote} ${kin} ${Press}" file data.prepare
dump 1 all custom 20000 dump.relax id element x y z
dump_modify 1 element Ti H
fix 1 all nvt temp 300 300 4
run 2000
unfix extra
undump 1
unfix 1
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=cpu-1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH --error=%j.err
#SBATCH --output=%j.out
#SBATCH --comment=ddli_p1
#source /opt/intel/bin/compilervars.sh intel64
mpirun -n 64 lmp_mpi -in in.lammps > out.log
b.There are only 54 atoms in my structure file "123". Here are my parallel-related settings:
export TF_ENABLE_ONEDNN_OPTS=0
export OMP_NUM_THREADS=1
#export DP_INTRA_OP_PARALLELISM_THREADS=8
#export DP_INTER_OP_PARALLELISM_THREADS=2
test.zip
Beta Was this translation helpful? Give feedback.
All reactions