LAMMPS aborted unexpectedly #4658

walk-for-me · 2025-03-15T09:21:50Z

walk-for-me
Mar 15, 2025

I used DeepMD-Kit v3.0.0 to train a DP model, and when using Deepmdkit's built-in Lammps to run MD, I encountered the situation that it was terminated before running a step. I do not know the reason, and I would like to ask if there are any predecessors who understand and ask for your advice, thank you.
The following is a specific description of the problem:

1. Output file：
a.This is the error log of the slurm commit script in the test.slurm.e file:

DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. DeePMD-kit WARNING: Environmental variable Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.528985: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.528911: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-14 19:00:33.578104: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
2025-03-14 19:00:33.578108: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
2025-03-14 19:00:33.578109: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.

b.This is the error message in the out.log file of LAMMPS's output log:

WARNING: Triclinic box skew is large. LAMMPS will run inefficiently. (src/domain.cpp:221)
4 by 4 by 4 MPI processor grid
reading atoms ...
54 atoms
read_data CPU = 0.218 seconds
Summary of lammps deepmd module ...

Info of deepmd-kit:
installed to: /home/u2022170376/deepmd-kit
source:
source branch: HEAD
source commit: b1be266
source commit at: 2024-11-23 01:37:55 -0800
support model ver.: 1.1
build variant: cpu
build with tf inc: /home/u2022170376/deepmd-kit/lib/python3.12/site-packages/tensorflow/include;/home/u2022170376/deepmd-kit/lib/python3.12/site-packages/tensorflow/../../../../include
build with tf lib: /home/u2022170376/deepmd-kit/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/home/u2022170376/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10.so
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
Info of lammps module:
use deepmd-kit at: /home/u2022170376/deepmd-kit
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 28908 RUNNING AT cpu1
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

c.In addition, a large number of core. files were generated. These files contain a bunch of garbled characters, and each of these files is over 600MB in size.

2.Input file
a.This is my in file and Slurm script:

units metal
dimension 3
boundary p p p
atom_style atomic
timestep 0.001
neighbor 1.0 bin
neigh_modify every 1 delay 0 check yes
thermo 10000

read_data 123

pair_style deepmd graph-compress.pb
pair_coeff * *

variable N equal step
variable pote equal pe
variable kin equal ke
variable T equal temp
variable Press equal press
variable V equal vol

fix 1 all box/relax iso 100
min_style cg
minimize 1.0e-12 1.0e-12 10000 100000
unfix 1
reset_timestep 0

velocity all create 300 4928459 dist gaussian

#relax
restart 20000 restart.*.prepare
fix extra all print 20000 "${N} ${T} ${V} ${pote} ${kin} ${Press}" file data.prepare
dump 1 all custom 20000 dump.relax id element x y z
dump_modify 1 element Ti H

fix 1 all nvt temp 300 300 4
run 2000

unfix extra
undump 1
unfix 1

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=cpu-1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH --error=%j.err
#SBATCH --output=%j.out
#SBATCH --comment=ddli_p1

#source /opt/intel/bin/compilervars.sh intel64

mpirun -n 64 lmp_mpi -in in.lammps > out.log

b.There are only 54 atoms in my structure file "123". Here are my parallel-related settings:

export TF_ENABLE_ONEDNN_OPTS=0
export OMP_NUM_THREADS=1
#export DP_INTRA_OP_PARALLELISM_THREADS=8
#export DP_INTER_OP_PARALLELISM_THREADS=2

test.zip

njzjz · 2025-03-16T13:24:05Z

njzjz
Mar 16, 2025
Maintainer

See tensorflow/tensorflow#41532. You seem to have to limit the threads to what your system allows. Also, threads may compete if there are too many.

2 replies

walk-for-me Mar 17, 2025
Author

Thank you, I checked carefully, then passed the test and set these three environment variables "export OMP_NUM_THREADS=1
, export DP_INTRA_OP_PARALLELISM_THREADS=8, export DP_INTER_OP_PARALLELISM_THREADS=2“.The problem was initially solved.

walk-for-me Mar 17, 2025
Author

Of course, what else do you think needs to be improved, please tell me, thank you！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LAMMPS aborted unexpectedly #4658

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

LAMMPS aborted unexpectedly #4658

Uh oh!

walk-for-me Mar 15, 2025

Replies: 1 comment · 2 replies

Uh oh!

njzjz Mar 16, 2025 Maintainer

Uh oh!

walk-for-me Mar 17, 2025 Author

Uh oh!

walk-for-me Mar 17, 2025 Author

walk-for-me
Mar 15, 2025

Replies: 1 comment 2 replies

njzjz
Mar 16, 2025
Maintainer

walk-for-me Mar 17, 2025
Author

walk-for-me Mar 17, 2025
Author