Large gridpack fails with non zero status 137 #3131

GiacomoBoldrini · 2022-03-16T15:54:43Z

Dear genproduction maintainers,

we've been trying for some months to generate gridpacks for two VBS processes with EFT contributions.
The processes are: VBS OSWW in the fully leptonic final state [1] and VBS WV with semileptonic final state [2].
As support to this thread we prepared some sides with all the tests we did and can be found here.
We are using cmsconnect and modified central scripts just to download the correct UFO models here and to avoid memory related compilation issues here.

While submitting integrate steps on condor we found that multiple jobs go into hold state with holdcode 13 and subcode 2 (condor_starter or shadow failed to send job) and eventually the computation stops

INFO: ClusterId 12046135 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046139 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046143 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046151 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046040 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12034792 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12034924 was held with code 13, subcode 2. Releasing it. 
WARNING: ClusterId 12034926 with HoldReason: Error from [email protected]: Failed to execute '/data/02/tmp/execute/dir_9531/glide_epkWgb/condor_job_wrapper.sh' with arguments /data/02/tmp/execute/dir_9531/glide_epkWgb/execute/dir_36287/condor_exec.exe 0 2223.239 2223.241: (errno=2: 'No such file or directory') 
....

Running the same cards locally we see that local generation also fails as survey.sh ends with non zero status 137.

Generating gridpack with run name pilotrun
survey  pilotrun --accuracy=0.01 --points=2000 --iterations=8 --gridpack=.true.
INFO: compile directory
INFO: Using LHAPDF v6.2.1 interface for PDFs
compile Source Directory
Using random number seed offset = 21
INFO: Running Survey
Creating Jobs
Working on SubProcesses
INFO:     P1_qq_lvlqqqq
INFO:  Idle: 2080,  Running: 48,  Completed: 0 [ current time: 19h14 ]
rm: cannot remove 'results.dat': No such file or directory
ERROR DETECTED
WARNING: program /scratch/gboldrin/gp3/bin/MadGraph5_aMCatNLO/WmVjj_ewk_dim6/WmVjj_ewk_dim6_gridpack/work/processtmp/SubProcesses/survey.sh 0 89 90 launch ends with non zero status: 137. Stop all computation

Upon a quick research on the internet we found that this could point to a memory issue, meaning the survey.sh job is requiring too much ram resources and so a sigkill is issued by the os.

Is this behaviour known? Maybe we are doing something wrong.
Do you have any tests we can carry in order to better understand this issue?

Thank you for your time.
Best,

Giacomo

[1]

import model SMEFTsim_U35_MwScheme_UFO-cW_cHWB_cHDD_cHbox_cHW_cHl1_cHl3_cHq1_cHq3_cqq1_cqq11_cqq31_cqq3_cll_cll1_massless
define p = g u c d s b u~ c~ d~ s~ b~
define j = g u c d s b u~ c~ d~ s~ b~
define l+ = e+ mu+ ta+
define l- = e- mu- ta-
define vl = ve vm vt
define vl~ = ve~ vm~ vt~
generate generate p p > l+ vl l- vl~ j j SMHLOOP=0 QCD=0 NP=1 
output WWjjTolnulnu_OS_ewk_dim6

[2]

import model SMEFTsim_U35_MwScheme_UFO-cW_cHWB_cqq1_cqq3_cqq11_cqq31_massless
define p = g u d s c b u~ d~ s~ c~ b~
define j = p
define l+ = e+ mu+ ta+
define l- = e- mu- ta-
define vl = ve vm vt
define vl~ = ve~ vm~ vt~
generate p p > l- vl~ j j j j NP=1 SMHLOOP=0 QCD=0
output WmVjj_ewk_dim6 -nojpeg

The text was updated successfully, but these errors were encountered:

sansan9401 · 2022-04-07T02:20:32Z

Sorry for I didn't check this thread. I saw some commits(UniMiBAnalyses@ba7149d, UniMiBAnalyses@ef4ee85) on your master branch. Did it solve the problem?

GiacomoBoldrini · 2022-04-07T16:36:12Z

Hi @sansan9401 , well the condor problem seems to come and go. Sometimes we are able to run integrate perfectly, sometimes it crashes as above. We did not understand were the problem comes from but probably it is due to which tier the jobs land (does it make sense? i refer to this [email protected] where the institution might change from run to run).

For the local submission we have no solution however we found that even if the integrate step is successful, the computation of the second reweight matrix element is killed by the os due to memory pressure. We reported the issue here (where btw you can see the integrate successfully run). So probably the local run of the integrate step is subject to the same problems as the reweight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large gridpack fails with non zero status 137 #3131

Large gridpack fails with non zero status 137 #3131

GiacomoBoldrini commented Mar 16, 2022

sansan9401 commented Apr 7, 2022 •

edited

Loading

GiacomoBoldrini commented Apr 7, 2022

Large gridpack fails with non zero status 137 #3131

Large gridpack fails with non zero status 137 #3131

Comments

GiacomoBoldrini commented Mar 16, 2022

sansan9401 commented Apr 7, 2022 • edited Loading

GiacomoBoldrini commented Apr 7, 2022

sansan9401 commented Apr 7, 2022 •

edited

Loading