retrieveData() revison collisions, incomplete data #189

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q · 2023-10-27T15:24:30Z

Well, it was bound to happen.

Yesterday, @JakobBD generated new REMIND input data using the revision 6.58. That finished around half past five.

SLURM Job Information
=======================
JobId=28328875 UserId=jakobdu(4731) GroupId=users(100) Name=rem-preprocessing 
JobState=COMPLETED Partition=priority TimeLimit=1440 StartTime=2023-10-26T09:58:56 
EndTime=2023-10-26T17:24:17 NodeList=cs-f14c01b08 NodeCnt=1 ProcCnt=1 
WorkDir=/p/tmp/jakobdu/pre-processing ReservationName= Gres= Account=rdev 
QOS=priority WcKey= Cluster=hlrs2015 SubmitTime=2023-10-26T09:58:44 
EligibleTime=2023-10-26T09:58:44 DerivedExitCode=0:0 ExitCode=0:0
=======================

At some point, some other REMIND user (we do not know who yet) also set out to generate new input data with the revision 6.58. Luckily (since we thus were able to detect it) that run apparently failed around nine, leaving an incomplete input data revision in its wake.

$ stat -c "%n %y" /p/projects/rd3mod/inputdata/output/rev6.58_*.tgz | column -t
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc-980575d2_validationremind.tgz  2023-10-26  17:24:17.078817187  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc_remind.tgz                     2023-10-26  17:16:24.872150471  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_remind.tgz                     2023-10-26  20:57:58.956467411  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_validationremind.tgz           2023-10-26  16:59:36.346072550  +0200

Several issues here are worth addressing:

Social mechanisms for blocking revision numbers obviously do not suffice. We have users use lastrev to check the next free revision number. But concurrently started jobs will collide.
The easiest solution to me seems to create empty files at the start of the run to block them.
retrieveData() will happily overwrite output. In terms of scientific reproducibility, that should not happen.
So stopping if the files that would be written already exist would solve this (and contribute to revision blocking).
Somehow retrieveData() failed, but still wrote a legitimate-looking archive.
I downloaded the original rev6.58_62eff8f7_remind.tgz file for testing before it was corrupted. The new archive is missing several files, rendering it unfit for use with REMIND:
```
$ comm -3 <( tar -tf ~/PIK/swap/inputdata/output/rev6.58_62eff8f7_remind.tgz | sort ) <( tar -tf ~/PIK/Remind_input/rev6.58_62eff8f7_remind.tgz | sort )
  ./config.rds
  ./f21_tau_fe_sub.cs4r
  ./f21_tau_fe_tax.cs4r
  ./f21_tax_convergence.cs4r
  ./f_gdp.cs3r
  ./f_lab.cs3r
  ./f_pop.cs3r
  ./p01_boundInvMacro.cs4r
  ./pm_shPPPMER.cs4r
  ./regionmappingH12.csv
```
I have no idea how retrieveData() wrote an archive with missing files, but I think it should not do that. (It appears to not have been a race condition where the first run deleted the files while the second was packing up its archive. All files from the second run show creation dates later then 19:00.)

The text was updated successfully, but these errors were encountered:

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q · 2023-11-01T14:30:14Z

Log files of the colliding runs: /p/tmp/jakobdu/pre-processing/log-2023-10-26-28328875.out and /p/tmp/chengong/pre-processing/log-2023-10-26-28329766.out.

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q added bug enhancement labels Oct 27, 2023

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q assigned tscheypidi Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retrieveData() revison collisions, incomplete data #189

retrieveData() revison collisions, incomplete data #189

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Oct 27, 2023

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Nov 1, 2023 •

edited

Loading

retrieveData() revison collisions, incomplete data #189

retrieveData() revison collisions, incomplete data #189

Comments

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Oct 27, 2023

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Nov 1, 2023 • edited Loading

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Nov 1, 2023 •

edited

Loading