Skip to content

NaN error and integer overflow in number of atoms #12

@minitu

Description

@minitu

Hello,
I'd like to report two errors that I observed when running the MPI + Kokkos version of MiniMD (miniMD/kokkos).

The first error is the T and P values showing up as NaN, which causes some kernels to run abnormally fast.
The configuration is as the following, executed on 32 nodes of OLCF Summit:

$ jsrun -n192 -a1 -c1 -g1 -K3 -r6 -M -gpu ./miniMD -i in.lj.miniMD -gn 0 -nx 768 -ny 768 -nz 384 -n 100
# Create System:
# Done ....
# miniMD-Reference 1.2 (MPI+OpenMP) output ...
# Run Settings:
        # MPI processes: 192
        # Host Threads: 1
        # Inputfile: ../inputs/in.lj.miniMD
        # Datafile: None
# Physics Settings:
        # ForceStyle: LJ
        # Force Parameters: 1.00 1.00
        # Units: LJ
        # Atoms: 905969664
        # Atom types: 8
        # System size: 1289.93 1289.93 644.96 (unit cells: 768 768 384)
        # Density: 0.844200
        # Force cutoff: 2.500000
        # Timestep size: 0.005000
# Technical Settings:
        # Neigh cutoff: 2.800000
        # Half neighborlists: 1
        # Team neighborlist construction: 0
        # Neighbor bins: 460 460 230
        # Neighbor frequency: 1000
        # Sorting frequency: 1000
        # Thermo frequency: 100
        # Ghost Newton: 0 
        # Use intrinsics: 0
        # Do safe exchange: 0
        # Size of float: 8

# Starting dynamics ...   
# Timestep T U P Time
0 nan -6.773368e+00 nan  0.000
100 nan 0.000000e+00 nan  1.138


# Performance Summary:
# MPI_proc OMP_threads nsteps natoms t_total t_force t_neigh t_comm t_other performance perf/thread grep_string t_extra
192 1 100 905969664 1.137955 0.050640 0.000000 0.671161 0.416153 79613833194.819092 414655381.223016 PERF_SUMMARY 0.000000

The second error is an integer overflow error in the total number of atoms, with large problem sizes:

$ jsrun -n1536 -a1 -c1 -g1 -K3 -r6 -M -gpu ./miniMD -i in.lj.miniMD -gn 0 -nx 1536 -ny 1536 -nz 768 -n 100
# Create System:
# Done ....
# miniMD-Reference 1.2 (MPI+OpenMP) output ...
# Run Settings:
        # MPI processes: 1536
        # Host Threads: 1
        # Inputfile: ../inputs/in.lj.miniMD
        # Datafile: None
# Physics Settings:
        # ForceStyle: LJ
        # Force Parameters: 1.00 1.00
        # Units: LJ
        # Atoms: -1342177280
        # Atom types: 8
        # System size: 2579.86 2579.86 1289.93 (unit cells: 1536 1536 768)
        # Density: 0.844200
        # Force cutoff: 2.500000
        # Timestep size: 0.005000
# Technical Settings:
        # Neigh cutoff: 2.800000
        # Half neighborlists: 1
        # Team neighborlist construction: 0
        # Neighbor bins: 921 921 460
        # Neighbor frequency: 1000
        # Sorting frequency: 1000
        # Thermo frequency: 100
        # Ghost Newton: 0
        # Use intrinsics: 0
        # Do safe exchange: 0
        # Size of float: 8

# Starting dynamics ...
# Timestep T U P Time
0 1.440000e+00 3.657619e+01 -6.220309e+00  0.000
100 1.435069e+00 3.657569e+01 -6.219723e+00  2.041


# Performance Summary:
# MPI_proc OMP_threads nsteps natoms t_total t_force t_neigh t_comm t_other performance perf/thread grep_string t_extra
1536 1 100 -1342177280 2.040788 0.056916 0.000000 0.852597 1.131275 -65767589680.726250 -42817441.198389 PERF_SUMMARY 0.000000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions