Issues saving large number of equations #60

ajhoffman1229 · 2024-03-31T15:55:49Z

I've been running SISSO (v3.2 and v3.3) on my university's HPC. I've noticed that the program tends to crash with the errors below if I try to save large numbers of models. In particular, for desc_dim = 1 in SISSO.in, I have been trying to use nf_sis = 50000 and nmodels = 50000, but I often get these errors:
slurmstepd: error: poll(): Bad address
in the SLURM error file and

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 31 PID 400794 RUNNING AT node1126
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

in the output of the run (there are many of these for different PID ranks because I am running the code in parallel and they all seem to fail simultaneously). When I increase the number of cores that I request (I have gone up to 128 cores, which is 4 full nodes on the cluster), that does not seem to fix the problem, so I don't think it's a memory issue.

However, if I run the code with nf_sis = 1000, it finishes without any issues and in about 5 minutes. I am trying to follow the suggested approach in this paper, where the authors saved 100,000 equations and fit decision trees to those equations (although I have a larger set of training data for my SISSO run). I have attached both my SISSO.in and SLURM script (job.sh) below. Is there a way to reliably save many of these equations without the program crashing? Thank you!

SISSO.in:

!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Below are the list of keywords for SISSO. Use exclamation mark,!,to comment out a line.
! The (R), (C) and (R&C) denotes the keyword to be used by regression, classification and both, respectively.
! More explanations on these keywords can be found in the SISSO_guide.pdf
! Users need to change the setting below according to your data and job.
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ptype=2                 !Property type 1: regression, 2:classification.
ntask=1                 !(R&C) Multi-task learning (MTL) is invoked if >1.
!task_weighting=1        !(R) MTL 1: no weighting (tasks treated equally), 2: weighted by the # of samples.
!scmt=.false.            !(R) Sign-Constrained MTL is invoked if .true.
desc_dim=1              !(R&C) Dimension of the descriptor, a hyperparmaeter.
!nsample=5               !(R) Number of samples in train.dat. For MTL, set nsample=N1,N2,... for each task.
nsample=(5000,500)    !(C) Number of samples. For MTL, set nsample=(n1,n2,...),(m1,m2,...),... for each tak.
restart=0               !(R&C) 0: starts from scratch, 1: continues the job(progress in the file CONTINUE)

!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Feature construction (FC) and sure independence screening (SIS)
! Implemented operators:(+)(-)(*)(/)(exp)(exp-)(^-1)(^2)(^3)(sqrt)(cbrt)(log)(|-|)(scd)(^6)(sin)(cos)
! scd: standard Cauchy distribution
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
nsf=12                 !(R&C) Number of scalar features provided in the file train.dat
ops='(+)(-)(*)(/)(exp)(exp-)(^-1)(^2)(log)'     !(R&C) Please customize the operators from the list shown above.
fcomplexity=4          !(R&C) Maximal feature complexity (# of operators in a feature), integer usually 0 to 7.
funit=(1:3)(4:5)       !(R&C) (n1:n2): features from n1 to n2 in the train.dat have same units
fmax_min=1e-3          !(R&C) The feature will be discarded if the max. abs. value in it is < fmax_min.
fmax_max=1e8           !(R&C) The feature will be discarded if the max. abs. value in it is > fmax_max.
nf_sis=25000              !(R&C) Number of features in each of the SIS-selected subspace.

!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Descriptor identification (DI) via sparse regression
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
method_so='L0'         !(R&C) 'L0' or 'L1L0'(LASSO+L0). The 'L0' is recommended for both ptype=1 and 2.
!nl1l0= 1               !(R) Only useful if method_so = 'L1L0', number of LASSO-selected features for the L0.
!fit_intercept=.true.   !(R) Fit to a nonzero (.true.) or zero (.false.) intercept for the linear model.
!metric='RMSE'          !(R) The metric for model selection in regression: RMSE or MaxAE (max absolute error)
nmodels=50000            !(R&C) Number of the top-ranked models to output (see the folder 'models')
isconvex=(1,1)         !(C) Each data group constrained to be convex domain, 1: YES; 0: NO
bwidth=0.001           !(C) Boundary tolerance for classification

job.sh

#!/bin/bash
#SBATCH -n 64
#SBATCH -N 2
#SBATCH -t 36:00
#SBATCH --no-requeue
#SBATCH -o output_%j.log
#SBATCH -e error_%j.log
#SBATCH --exclusive
#SBATCH -J sisso
#SBATCH --mem-per-cpu=3800
#SBATCH --exclusive

source ~/.bashrc

echo "Loading intel module" >> log
module load intel

echo "Loading impi module" >> log
module load impi

echo "nodes: $SLURM_JOB_NUM_NODES hosts: $SLURM_JOB_NODELIST" >> log

echo "Start time: $( date )" >> log
mpirun -np 64 SISSO >> log
echo "End time: $( date )" >> log

The text was updated successfully, but these errors were encountered:

rouyang2017 · 2024-04-01T01:30:48Z

Hi, that looks en error due to insufficient memory when large number of features (large nf_sis) are to be saved. In current implementation, each feature is stored in the form of a column of data. While this can ensure the speed of calculation, it can easily lead to the demand of huge memory when the size of training data or feature space is large. We are considering to remove this difficulty in future versions by storing features in the form of expression trees, instead of data. For now, what I suggest is to use less cores per node, e.g. mpirun -np 32 SISSO > log, if a large nf_sis is needed.

ajhoffman1229 · 2024-04-01T11:58:04Z

Thank you for the quick reply! ~~When you suggest using mpirun -np 32 SISSO, you mean without otherwise altering the total number of cores?~~

Update/Edit: This change seemed to work. I kept the total number of cores identical (-N 2 and -n 64 in the SLURM script) and changed the number of parallel processes in the bash scripts with mpirun -np 16 SISSO > log and the job ran without issues and saved 30,000 equations. Thank you for your help! I will close this issue now.

ajhoffman1229 closed this as completed Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues saving large number of equations #60

Issues saving large number of equations #60

ajhoffman1229 commented Mar 31, 2024

rouyang2017 commented Apr 1, 2024 •

edited

Loading

ajhoffman1229 commented Apr 1, 2024 •

edited

Loading

Issues saving large number of equations #60

Issues saving large number of equations #60

Comments

ajhoffman1229 commented Mar 31, 2024

rouyang2017 commented Apr 1, 2024 • edited Loading

ajhoffman1229 commented Apr 1, 2024 • edited Loading

rouyang2017 commented Apr 1, 2024 •

edited

Loading

ajhoffman1229 commented Apr 1, 2024 •

edited

Loading