Increasing performance in molecular dynamics calculation #1156

icamps · 2025-01-11T20:21:36Z

icamps
Jan 11, 2025

Hello.

My system has around 1200 atoms. I set up a molecular dynamic simulation using the GFNFF force field and launch xTB using:
(xtb version 6.7.1 (43a0e4e) compiled by 'runner@fv-az1269-365' on 2024-09-19)

export NUM_THREADS=18
export MKL_NUM_THREADS=${NUM_THREADS}
export OMP_NUM_THREADS=${NUM_THREADS},1
export OMP_STACKSIZE=15G
ulimit -s unlimited

This job was run in my notebook that has 20 cores. After launching the script, I change the job to the highest priority using nice -20 for each of the parallel processes.

The image below show that all the processes are running in nice -20 but each core occupancy is well below 20%.

Is there a way to increase the job performance in a way that the cores are used more efficiently?

Best,
Camps

Answered by tobigithub

Feb 19, 2025

@icamps thanks for providing the input data. Multiple aspects come to my mind:

The Intel i7-13650HX has 14 true CPU cores, see Intel SKU [Link]

     - Performance-cores: 6 cores x 2 threads per core = 12 threads
     - Efficient-cores: 8 cores x 1 thread per core = 8 threads
     - Total threads: 12 + 8 = 20 threads

Using more than the provided true CPU cores does not always increase the performance, lets say 1 core/ 2 threads would not automatically double the speed. In many cases when oversubscribing CPU cores, the algorithm could even become slower. There are exceptions, with hybrid algorithms, that use multiple different processes, here hyperthreading could massively speed up sof…

View full answer

tobigithub · 2025-02-13T01:11:49Z

tobigithub
Feb 13, 2025

@icamps can you share your input coordinates, parameter file and command line? And lscpu or cpu-z, basically what is the processor type?

1 reply

icamps Feb 16, 2025
Author

Sure. Attached is a PDB file (I added TXT to the extension).

Also, I was using the following:

export NUM_THREADS=18
export MKL_NUM_THREADS=${NUM_THREADS}
export OMP_NUM_THREADS=${NUM_THREADS},1
export OMP_STACKSIZE=15G
ulimit -s unlimited

and running as:
xtb --gfnff AirMix_MCNB+COCl2+20.pdb --input in_md.inp --md --namespace out_AirMix_MCNB+COCl2+20 -P ${NUM_THREADS} --iterations 5000

with in_md.inp like:

$md
  # restart=true
   forcewrrestart=true
   temp=298.15 # in K
   time= 100.0  # in ps
   dump= 50.0  # in fs
   step=  1.0  # in fs
   velo= false
   nvt = true
   hmass=4
   shake=0 # use SHAKE algorithm to constrain bonds 0 = off, 1 = X-H only, 2 = all bonds
   sccacc=1.0
$end

AirMix_MCNB+COCl2+20.pdb.txt

tobigithub · 2025-02-16T23:18:06Z

tobigithub
Feb 16, 2025

@icamps thanks! What are the CPU specs lscpu (Linux) cpu-z (Win), because maybe you are oversubscribing threads, which could lead to lower overall performance. Basically more threads assigned to xtb, than actual CPU cores are available. Reason, you show 20 threads or CPUs in the screenshot, but are they true CPU cores (AMD, Intel)? So what is the specific type of CPU?

Plus for scaling, start benchmarking with smaller molecules 200Da, 400Da, 800Da and checking CPU utilization usually gives better insights. Also there could be IO overhead, unless you use a SSD or SSD RAID (multiple SSDs set as RAID).

0 replies

icamps · 2025-02-17T19:09:14Z

icamps
Feb 17, 2025
Author

Hey @tobigithub , the output of lscpu is below.

Even when the notebook has only 14 real cores, it can perform using 20 threads. In this case, I run with 18 threads.
(a note: I had run other programs - SIESTA/MOPAC/NAMD- using these 20 threads as "normal" CPUs without any issue, using them at 100% occupation)

Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        39 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               20
On-line CPU(s) list:                  0-19
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i7-13650HX
CPU family:                           6
Model:                                183
Thread(s) per core:                   2
Core(s) per socket:                   14
Socket(s):                            1
Stepping:                             1
CPU(s) scaling MHz:                   24%
CPU max MHz:                          4900.0000
CPU min MHz:                          800.0000
BogoMIPS:                             5606.40
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            544 KiB (14 instances)
L1i cache:                            704 KiB (14 instances)
L2 cache:                             11.5 MiB (8 instances)
L3 cache:                             24 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-19
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Mitigation; Clear Register File
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

0 replies

tobigithub · 2025-02-19T03:14:52Z

tobigithub
Feb 19, 2025

@icamps thanks for providing the input data. Multiple aspects come to my mind:

The Intel i7-13650HX has 14 true CPU cores, see Intel SKU [Link]

     - Performance-cores: 6 cores x 2 threads per core = 12 threads
     - Efficient-cores: 8 cores x 1 thread per core = 8 threads
     - Total threads: 12 + 8 = 20 threads

Using more than the provided true CPU cores does not always increase the performance, lets say 1 core/ 2 threads would not automatically double the speed. In many cases when oversubscribing CPU cores, the algorithm could even become slower. There are exceptions, with hybrid algorithms, that use multiple different processes, here hyperthreading could massively speed up software. If a software is already well designed in terms of parallelism, the CPU cache can be limiting or the overall CPU architecture itself. Sometimes the IO overhead or the computational overhead to stitch multiple threads together can be also more expensive. So using more threads than actual CPUs are available can decrease performance.
I was able to saturate the xtb MD process to 120% utilization, meaning it is possible to saturate it >100% however benchmarks revealed that its actually getting slower when more threads than CPUs are utilized. In this case I run MD: time= 5.0 ps and the carbon nanobelt only. And then for each run I added updated environment variables (all of them need to be reloaded, otherwise, they stay old).

From the results with 96 true CPU cores available, it seems that the algorithm does not really benefit from more CPUs, one could run multiple molecules for example at the same time, but rather peak optimum is at a 1/4 of all CPUs used (could be boost GHz) but when the number of defined NUM_THREADS > true number of CPU cores we see a sudden drop in performance.

# 5 ps MD of nano carbon nanobelt only (96 true CPU cores)
# check true xtb process number with:  ps -L -C xtb | grep -v CMD | wc -l

    12 CPUs:  1 min,  4.693 sec
    24 CPUs:  1 min,  0.028 sec (best)
    48 CPUs:  1 min,  0.799 sec
    72 CPUs:  1 min,  1.836 sec
    96 CPUs:  1 min,  3.354 sec
    120 CPUs: 1 min, 24.018 sec (oversubscription)

So in your case I would run an experiment with 6, 10, 14, 18, 22 NUM_THREADS and check which one performs fastest, even if 6 cores is the fastest (because the can perform with higher GHz number) that is the number you use. Basically TurboBoost with 4.90 GHz and fewer CPU cores might be faster than 14 cores with Max Turbo Frequency at sustained (but lower) 3.60 GHz.
Figuring out bottlenecks in HPC software usually requires a profiler, in this case something like the free Intel Vtune, but it is not an easy tool to use. I think it also can do OMP profiling, but that is even more complicated. In terms of practical use you would also need to recompile xtb yourself, which is not that easy either.

Yeah, final thought its a mini benchmark and there could be multiple errors made, but figuring out the fastest overall program use would be the most practical way I guess. Also older AMD and Intel CPUs on a HPC compute cluster perform very well, even if they run at 2 Ghz or so, simply because you can use 100s of cores and just keep them running or even run multiple experiments! So for such large molecules I would rather use a cluster.

1 reply

icamps Feb 19, 2025
Author

Thank you very much for the detailed answer.

Increasing performance in molecular dynamics calculation #1156

Uh oh!

icamps Jan 11, 2025

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

tobigithub Feb 13, 2025

Uh oh!

Uh oh!

icamps Feb 16, 2025 Author

Uh oh!

Uh oh!

tobigithub Feb 16, 2025

Uh oh!

icamps Feb 17, 2025 Author

Uh oh!

tobigithub Feb 19, 2025

Uh oh!

icamps Feb 19, 2025 Author

icamps
Jan 11, 2025

Replies: 4 comments 2 replies

tobigithub
Feb 13, 2025

icamps Feb 16, 2025
Author

tobigithub
Feb 16, 2025

icamps
Feb 17, 2025
Author

tobigithub
Feb 19, 2025

icamps Feb 19, 2025
Author