rt.sh job dependency #2400
-
I am testing new code that is going to change results for all or most of the job's results. https://github.com/jiandewang/ufs-weather-model/blob/develop/tests/rt.conf#L28-L32 in the above, the first 3 jobs were being launched (with failure to match current BL) but the last one didn't. Does this mean when there is BL changes, the job that not marked with "baseline" or dependency here will not be launched at all ? |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 7 replies
-
@jiandewang Tests with something in the final column (ie, they are dependent on another test) should not run unless if that test fails. So for example, the restart test ufs-weather-model/tests/tests/cpld_restart_gfsv17 Lines 7 to 9 in 6d0454c shows that the restart test is comparing against the control, so if control fails, the restart test should not even run. It does seem the dependency for the mpi gfsv17 test is not set correctly ufs-weather-model/tests/rt.conf Line 32 in 6d0454c It should be dependent on the ufs-weather-model/tests/tests/cpld_mpi_gfsv17 Lines 7 to 9 in 6d0454c |
Beta Was this translation helpful? Give feedback.
-
@DeniseWorthen I repeated my last night's run twice today by changing nothing, now the mpi job got launched. Don't know why it was not being launched in my first try last night. Maybe computer issue ? I think mpi run should dependent on the control's result for comparison, here I mean it should be compared with the new result of my fresh contorl run, not the existing BL. What it does now is mpi results will compare current BL. This is fine if my testing has no answer changes. But for the one I am testing now it is supposed to have answer changes. In this way (comparing with existing BL) mpi job will always fail. |
Beta Was this translation helpful? Give feedback.
-
@jiandewang If you want to test that the mpi test passes against control you'll need to create a new baseline and run against that. |
Beta Was this translation helpful? Give feedback.
-
@DeniseWorthen yes that's the way (-c) I am going to do. |
Beta Was this translation helpful? Give feedback.
-
@jiandewang When you run rt.sh in baseline create mode ( The label in the last column ('dependency') has nothing to do with whether or not a test will be run. It only determines when the test is going to be run. Tests with dependency will be executed only after successful run of the test it depends on. When you run rt in 'verify mode' i.e. without In your specific example, assuming you have a .conf file, let's say named 'my.conf':
When you run:
Only, and only, compile job and two tests (cpld_control_gfsv17 and cpld_control_gfsv17_iau) will be executed. Then if you run:
compile job and all 4 tests will be executed. cpld_control_gfsv17 and cpld_mpi_gfsv17 can be executed in any order, they do not depend on any other tests. cpld_control_gfsv17_iau and cpld_restart_gfsv17 will be executed after cpld_control_gfsv17. Yes, if cpld_control_gfsv17 fails to verify against its own (updated) baselines, most probably cpld_mpi_gfsv17 will fail, because they both have the same baselines. |
Beta Was this translation helpful? Give feedback.
-
@DusanJovic-NOAA Since cpld_mpi_gsfv17 compares against cpld_control_gsfv17, it should be dependent on that test. The dependency was mistakenly removed at 69df411 |
Beta Was this translation helpful? Give feedback.
-
No.
Run this command and you will see that only The term "dependency" or "... test depends on ..." in the context of rt.sh and rt.conf only determines the order of execution of tests, and has nothing to do with the concept of baselines. Which baselines are used for test verification is a separate property. See my comments here #2278 (comment) and here #2278 (comment) where I tried, albeit unsuccessfully, to explain the differences in 'baseline' and 'dependency' concepts and importance of the distinction between the two. |
Beta Was this translation helpful? Give feedback.
-
@DusanJovic-NOAA see /lfs/h2/emc/couple/noscrub/Jiande.Wang/MOM6-update/2024-FMA/ufs-weather-model/test/rt.conf-S2S I used rt.sh -e -k -l in my first try I found that those three "mpi" jobs were not being launched but in my 2nd and 3rd try (changed nothing on script) the three mpi jobs were being launched. So I guess something was wrong in the machine in my 1st try. BTW, what's -m mean here ? another question: after I ran rt.sh and I realized that there is something wrong in my test and I need to kill it, what's the right way to do that ? what I did is use "ps" to find out that job ID #, then kill xxx but for this method I found I have to run "kill" many times before the job is finally terminated. |
Beta Was this translation helpful? Give feedback.
-
(1) for -m, there is no need to specify the newly generated BL dir path in rt.sh (it's at ........FV3_RT/REGRESSION_TEST) by myself, is that right ? (2) after I use scancel and cancelled all run and compile jobs, I found that rt.sh job is still in the sysyem (using "ps") and it will take a while for it to disappear. |
Beta Was this translation helpful? Give feedback.
(1) Yes. The location of the new baselines is always "
NEW_BASELINE=${STMP}/${USER}/FV3_RT/REGRESSION_TEST
", so that both./rt.sh -c ....
know where to save new baselines and./rt.sh -m ...
where to find it.(2) If you want to cancel all jobs, just cancel rt.sh, it should cancel all submitted jobs.