Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retrieveData() revison collisions, incomplete data #189

Open
3 tasks
0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q opened this issue Oct 27, 2023 · 1 comment
Open
3 tasks
Assignees

Comments

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member

Well, it was bound to happen.

Yesterday, @JakobBD generated new REMIND input data using the revision 6.58. That finished around half past five.

SLURM Job Information
=======================
JobId=28328875 UserId=jakobdu(4731) GroupId=users(100) Name=rem-preprocessing 
JobState=COMPLETED Partition=priority TimeLimit=1440 StartTime=2023-10-26T09:58:56 
EndTime=2023-10-26T17:24:17 NodeList=cs-f14c01b08 NodeCnt=1 ProcCnt=1 
WorkDir=/p/tmp/jakobdu/pre-processing ReservationName= Gres= Account=rdev 
QOS=priority WcKey= Cluster=hlrs2015 SubmitTime=2023-10-26T09:58:44 
EligibleTime=2023-10-26T09:58:44 DerivedExitCode=0:0 ExitCode=0:0
=======================

At some point, some other REMIND user (we do not know who yet) also set out to generate new input data with the revision 6.58. Luckily (since we thus were able to detect it) that run apparently failed around nine, leaving an incomplete input data revision in its wake.

$ stat -c "%n %y" /p/projects/rd3mod/inputdata/output/rev6.58_*.tgz | column -t
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc-980575d2_validationremind.tgz  2023-10-26  17:24:17.078817187  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc_remind.tgz                     2023-10-26  17:16:24.872150471  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_remind.tgz                     2023-10-26  20:57:58.956467411  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_validationremind.tgz           2023-10-26  16:59:36.346072550  +0200

Several issues here are worth addressing:

  • Social mechanisms for blocking revision numbers obviously do not suffice. We have users use lastrev to check the next free revision number. But concurrently started jobs will collide.
    The easiest solution to me seems to create empty files at the start of the run to block them.
  • retrieveData() will happily overwrite output. In terms of scientific reproducibility, that should not happen.
    So stopping if the files that would be written already exist would solve this (and contribute to revision blocking).
  • Somehow retrieveData() failed, but still wrote a legitimate-looking archive.
    I downloaded the original rev6.58_62eff8f7_remind.tgz file for testing before it was corrupted. The new archive is missing several files, rendering it unfit for use with REMIND:
    $ comm -3 <( tar -tf ~/PIK/swap/inputdata/output/rev6.58_62eff8f7_remind.tgz | sort ) <( tar -tf ~/PIK/Remind_input/rev6.58_62eff8f7_remind.tgz | sort )
      ./config.rds
      ./f21_tau_fe_sub.cs4r
      ./f21_tau_fe_tax.cs4r
      ./f21_tax_convergence.cs4r
      ./f_gdp.cs3r
      ./f_lab.cs3r
      ./f_pop.cs3r
      ./p01_boundInvMacro.cs4r
      ./pm_shPPPMER.cs4r
      ./regionmappingH12.csv
    
    I have no idea how retrieveData() wrote an archive with missing files, but I think it should not do that. (It appears to not have been a race condition where the first run deleted the files while the second was packing up its archive. All files from the second run show creation dates later then 19:00.)
@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member Author

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented Nov 1, 2023

Log files of the colliding runs: /p/tmp/jakobdu/pre-processing/log-2023-10-26-28328875.out and /p/tmp/chengong/pre-processing/log-2023-10-26-28329766.out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants