Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues saving large number of equations #60

Closed
ajhoffman1229 opened this issue Mar 31, 2024 · 2 comments
Closed

Issues saving large number of equations #60

ajhoffman1229 opened this issue Mar 31, 2024 · 2 comments

Comments

@ajhoffman1229
Copy link

I've been running SISSO (v3.2 and v3.3) on my university's HPC. I've noticed that the program tends to crash with the errors below if I try to save large numbers of models. In particular, for desc_dim = 1 in SISSO.in, I have been trying to use nf_sis = 50000 and nmodels = 50000, but I often get these errors:
slurmstepd: error: poll(): Bad address
in the SLURM error file and

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 31 PID 400794 RUNNING AT node1126
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

in the output of the run (there are many of these for different PID ranks because I am running the code in parallel and they all seem to fail simultaneously). When I increase the number of cores that I request (I have gone up to 128 cores, which is 4 full nodes on the cluster), that does not seem to fix the problem, so I don't think it's a memory issue.

However, if I run the code with nf_sis = 1000, it finishes without any issues and in about 5 minutes. I am trying to follow the suggested approach in this paper, where the authors saved 100,000 equations and fit decision trees to those equations (although I have a larger set of training data for my SISSO run). I have attached both my SISSO.in and SLURM script (job.sh) below. Is there a way to reliably save many of these equations without the program crashing? Thank you!

SISSO.in:

!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Below are the list of keywords for SISSO. Use exclamation mark,!,to comment out a line.
! The (R), (C) and (R&C) denotes the keyword to be used by regression, classification and both, respectively.
! More explanations on these keywords can be found in the SISSO_guide.pdf
! Users need to change the setting below according to your data and job.
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ptype=2                 !Property type 1: regression, 2:classification.
ntask=1                 !(R&C) Multi-task learning (MTL) is invoked if >1.
!task_weighting=1        !(R) MTL 1: no weighting (tasks treated equally), 2: weighted by the # of samples.
!scmt=.false.            !(R) Sign-Constrained MTL is invoked if .true.
desc_dim=1              !(R&C) Dimension of the descriptor, a hyperparmaeter.
!nsample=5               !(R) Number of samples in train.dat. For MTL, set nsample=N1,N2,... for each task.
nsample=(5000,500)    !(C) Number of samples. For MTL, set nsample=(n1,n2,...),(m1,m2,...),... for each tak.
restart=0               !(R&C) 0: starts from scratch, 1: continues the job(progress in the file CONTINUE)

!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Feature construction (FC) and sure independence screening (SIS)
! Implemented operators:(+)(-)(*)(/)(exp)(exp-)(^-1)(^2)(^3)(sqrt)(cbrt)(log)(|-|)(scd)(^6)(sin)(cos)
! scd: standard Cauchy distribution
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
nsf=12                 !(R&C) Number of scalar features provided in the file train.dat
ops='(+)(-)(*)(/)(exp)(exp-)(^-1)(^2)(log)'     !(R&C) Please customize the operators from the list shown above.
fcomplexity=4          !(R&C) Maximal feature complexity (# of operators in a feature), integer usually 0 to 7.
funit=(1:3)(4:5)       !(R&C) (n1:n2): features from n1 to n2 in the train.dat have same units
fmax_min=1e-3          !(R&C) The feature will be discarded if the max. abs. value in it is < fmax_min.
fmax_max=1e8           !(R&C) The feature will be discarded if the max. abs. value in it is > fmax_max.
nf_sis=25000              !(R&C) Number of features in each of the SIS-selected subspace.

!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Descriptor identification (DI) via sparse regression
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
method_so='L0'         !(R&C) 'L0' or 'L1L0'(LASSO+L0). The 'L0' is recommended for both ptype=1 and 2.
!nl1l0= 1               !(R) Only useful if method_so = 'L1L0', number of LASSO-selected features for the L0.
!fit_intercept=.true.   !(R) Fit to a nonzero (.true.) or zero (.false.) intercept for the linear model.
!metric='RMSE'          !(R) The metric for model selection in regression: RMSE or MaxAE (max absolute error)
nmodels=50000            !(R&C) Number of the top-ranked models to output (see the folder 'models')
isconvex=(1,1)         !(C) Each data group constrained to be convex domain, 1: YES; 0: NO
bwidth=0.001           !(C) Boundary tolerance for classification

job.sh

#!/bin/bash
#SBATCH -n 64
#SBATCH -N 2
#SBATCH -t 36:00
#SBATCH --no-requeue
#SBATCH -o output_%j.log
#SBATCH -e error_%j.log
#SBATCH --exclusive
#SBATCH -J sisso
#SBATCH --mem-per-cpu=3800
#SBATCH --exclusive

source ~/.bashrc

echo "Loading intel module" >> log
module load intel

echo "Loading impi module" >> log
module load impi

echo "nodes: $SLURM_JOB_NUM_NODES hosts: $SLURM_JOB_NODELIST" >> log

echo "Start time: $( date )" >> log
mpirun -np 64 SISSO >> log
echo "End time: $( date )" >> log
@rouyang2017
Copy link
Owner

rouyang2017 commented Apr 1, 2024

Hi, that looks en error due to insufficient memory when large number of features (large nf_sis) are to be saved. In current implementation, each feature is stored in the form of a column of data. While this can ensure the speed of calculation, it can easily lead to the demand of huge memory when the size of training data or feature space is large. We are considering to remove this difficulty in future versions by storing features in the form of expression trees, instead of data. For now, what I suggest is to use less cores per node, e.g. mpirun -np 32 SISSO > log, if a large nf_sis is needed.

@ajhoffman1229
Copy link
Author

ajhoffman1229 commented Apr 1, 2024

Thank you for the quick reply! When you suggest using mpirun -np 32 SISSO, you mean without otherwise altering the total number of cores?

Update/Edit: This change seemed to work. I kept the total number of cores identical (-N 2 and -n 64 in the SLURM script) and changed the number of parallel processes in the bash scripts with mpirun -np 16 SISSO > log and the job ran without issues and saved 30,000 equations. Thank you for your help! I will close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants