rt.sh job dependency #2400

jiandewang · 2024-08-18T03:53:06Z

jiandewang
Aug 18, 2024

I am testing new code that is going to change results for all or most of the job's results.

https://github.com/jiandewang/ufs-weather-model/blob/develop/tests/rt.conf#L28-L32

in the above, the first 3 jobs were being launched (with failure to match current BL) but the last one didn't. Does this mean when there is BL changes, the job that not marked with "baseline" or dependency here will not be launched at all ?

Answered by DusanJovic-NOAA

Aug 19, 2024

(1) Yes. The location of the new baselines is always "NEW_BASELINE=${STMP}/${USER}/FV3_RT/REGRESSION_TEST", so that both ./rt.sh -c .... know where to save new baselines and ./rt.sh -m ... where to find it.

(2) If you want to cancel all jobs, just cancel rt.sh, it should cancel all submitted jobs.

View full answer

DeniseWorthen · 2024-08-18T20:27:22Z

DeniseWorthen
Aug 18, 2024
Maintainer

@jiandewang Tests with something in the final column (ie, they are dependent on another test) should not run unless if that test fails. So for example, the restart test

ufs-weather-model/tests/tests/cpld_restart_gfsv17

Lines 7 to 9 in 6d0454c


	export CNTL_DIR=cpld_control_gfsv17

shows that the restart test is comparing against the control, so if control fails, the restart test should not even run.

It does seem the dependency for the mpi gfsv17 test is not set correctly

ufs-weather-model/tests/rt.conf

Line 32 in 6d0454c

    
           RUN | cpld_mpi_gfsv17                                   | - noaacloud                          |          |

It should be dependent on the cpld_control_gfsv17 test, as shown in the actual test

ufs-weather-model/tests/tests/cpld_mpi_gfsv17

Lines 7 to 9 in 6d0454c


	export CNTL_DIR=cpld_control_gfsv17

0 replies

jiandewang · 2024-08-18T20:43:05Z

jiandewang
Aug 18, 2024
Author

@DeniseWorthen I repeated my last night's run twice today by changing nothing, now the mpi job got launched. Don't know why it was not being launched in my first try last night. Maybe computer issue ?

I think mpi run should dependent on the control's result for comparison, here I mean it should be compared with the new result of my fresh contorl run, not the existing BL. What it does now is mpi results will compare current BL. This is fine if my testing has no answer changes. But for the one I am testing now it is supposed to have answer changes. In this way (comparing with existing BL) mpi job will always fail.

0 replies

DeniseWorthen · 2024-08-18T20:46:53Z

DeniseWorthen
Aug 18, 2024
Maintainer

@jiandewang If you want to test that the mpi test passes against control you'll need to create a new baseline and run against that.

0 replies

jiandewang · 2024-08-18T21:38:30Z

jiandewang
Aug 18, 2024
Author

@DeniseWorthen yes that's the way (-c) I am going to do.
I think all 3 mpi jobs have the same dependency setting issues.

0 replies

DusanJovic-NOAA · 2024-08-19T11:43:20Z

DusanJovic-NOAA
Aug 19, 2024
Maintainer

@jiandewang When you run rt.sh in baseline create mode (./rt.sh -c ....) only tests labeled with 'baseline' will be executed. If you see a test not labeled with 'baseline' running it's a bug. If you see a test labeled with 'baseline` not running it's a bug. Please give us the exact .conf file you are using and the exact rt.sh command line you are executing, and I'll try to reproduce a bug.

The label in the last column ('dependency') has nothing to do with whether or not a test will be run. It only determines when the test is going to be run. Tests with dependency will be executed only after successful run of the test it depends on.

When you run rt in 'verify mode' i.e. without -c option, all tests will be executed, unless a test is turned off on a given machine, but that's not an issue here.

In your specific example, assuming you have a .conf file, let's say named 'my.conf':

COMPILE | s2swa_32bit_pdlib  | intel | -DAPP=S2SWA -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DPDLIB=ON | - noaacloud | fv3 |
RUN | cpld_control_gfsv17                       | - noaacloud                 | baseline |
RUN | cpld_control_gfsv17_iau                   | - noaacloud                 | baseline | cpld_control_gfsv17
RUN | cpld_restart_gfsv17                       | - noaacloud                 |          | cpld_control_gfsv17
RUN | cpld_mpi_gfsv17                           | - noaacloud                 |          |

When you run:

./rt.sh -l my.conf -c -e ...

Only, and only, compile job and two tests (cpld_control_gfsv17 and cpld_control_gfsv17_iau) will be executed. Then if you run:

./rt.sh -l my.conf -m -e ...

compile job and all 4 tests will be executed. cpld_control_gfsv17 and cpld_mpi_gfsv17 can be executed in any order, they do not depend on any other tests. cpld_control_gfsv17_iau and cpld_restart_gfsv17 will be executed after cpld_control_gfsv17. Yes, if cpld_control_gfsv17 fails to verify against its own (updated) baselines, most probably cpld_mpi_gfsv17 will fail, because they both have the same baselines.

0 replies

DeniseWorthen · 2024-08-19T11:58:24Z

DeniseWorthen
Aug 19, 2024
Maintainer

@DusanJovic-NOAA Since cpld_mpi_gsfv17 compares against cpld_control_gsfv17, it should be dependent on that test. The dependency was mistakenly removed at 69df411

0 replies

DusanJovic-NOAA · 2024-08-19T12:26:56Z

DusanJovic-NOAA
Aug 19, 2024
Maintainer

No. cpld_mpi_gfsv17 does NOT depend on cpld_control_gfsv17, and it should not depend on it, it can be executed before cpld_control_gfsv17. Its outputs are verified (compared) with the outputs created by cpld_control_gfsv17. That's the whole purpose of this test. You can even run cpld_mpi_gfsv17 without even running cpld_control_gfsv17 at all. For example:

./rt.sh -n 'cpld_mpi_gfsv17 intel'

Run this command and you will see that only cpld_mpi_gfsv17 test runs, cpld_control_gfsv17 test does not run at all. Which means it does not depend on it.

The term "dependency" or "... test depends on ..." in the context of rt.sh and rt.conf only determines the order of execution of tests, and has nothing to do with the concept of baselines. Which baselines are used for test verification is a separate property.

See my comments here #2278 (comment) and here #2278 (comment) where I tried, albeit unsuccessfully, to explain the differences in 'baseline' and 'dependency' concepts and importance of the distinction between the two.

5 replies

DeniseWorthen Aug 19, 2024
Maintainer

Hmm....Say you're testing (running) the rt.conf in non-baseline mode. You know that your PR will change baselines. It so happens that when the jobs run, cpld_mpi runs first. It will fail, as expected. But haven't you just wasted resources running that test, when it would be expected to fail anyway because the control case is changing baselines?

jiandewang Aug 19, 2024
Author

Hmm....Say you're testing (running) the rt.conf in non-baseline mode. You know that your PR will change baselines. It so happens that when the jobs run, cpld_mpi runs first. It will fail, as expected. But haven't you just wasted resources running that test, when it would be expected to fail anyway because the control case is changing baselines?

what if I am not sure my new code is going to change BL or not (that's why I am testing), if control job matched BL but mpi job failed, then I know something is wrong in my new code as it can't produce b4b results with different PE #. But if both control and mpi jobs failed, then I can't make clear conclusion to say that my new code will or will not generate b4b results with different PE # as here mpi job is comparing with pre-existing BL. So the reliable way is to use -c and -m to run rt.sh twice to verify b4b for mpi jobs.

DusanJovic-NOAA Aug 19, 2024
Maintainer

Maybe it will fail (and in this case it will probably fail) but maybe it will not fail. Do I understand correctly that you suggest we make all test that use baselines from their control run , depend on the control run, even though they do not depend on them strictly, like for example restart test which must depend for its control to finish in order to get the restart files from that run's run directory. Or do you suggest that we make only cpld_mpi_gfsv17 depend on it's control?

Because in general case control can fail, but other test that uses it's baselines might not fail, for example:

COMPILE | s2swa | intel | -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 | | fv3 |
RUN | cpld_control_p8                              | - noaacloud                          | baseline |
RUN | cpld_control_qr_p8                           | - noaacloud                          |          |

here cpld_control_p8 can fail but cpld_control_qr_p8 might succeed. If inadvertent change is made in part of the code where non-quilt restart are created. If we make cpld_control_qr_p8 depend on cpld_control_p8 rt will not even submit cpld_control_qr_p8, while it would pass, which is valuable information.

My point is, we should not use 'dependency' to control which test should run or not based on whether its control passes, only to control which test must run after its control finished. Yes, that will cost us in some cases but it's more safe and potentially more useful.

And making all test that share the baselines with their control run very confusing, because we will not know whether the dependency is because we need to run that test after its control, or whether we will skip the execution if the control fails.

DeniseWorthen Aug 19, 2024
Maintainer

Thanks Dusan. I understand now what you are saying! Dependency controls only when a test runs.

My point is, we should not use 'dependency' to control which test should run or not based on whether its control passes, only to control which test must run after its control finished. Yes, that will cost us in some cases but it's more safe and potentially more useful.

DeniseWorthen Aug 21, 2024
Maintainer

Just a final note on this discussion, in case others come here looking for explanation. In my initial comment above, I linked the concept of 'dependency' with whether the test compares against the baseline of another test (using the mpi test as an example).

In fact, that is not the use of 'dependency' within the context of the rt.conf. The mpi test can be run independently of the control test. It is not dependent. The restart test must retrieve the warmstart files from the control test. It is dependent.

jiandewang · 2024-08-19T12:31:48Z

jiandewang
Aug 19, 2024
Author

@DusanJovic-NOAA see /lfs/h2/emc/couple/noscrub/Jiande.Wang/MOM6-update/2024-FMA/ufs-weather-model/test/rt.conf-S2S

I used rt.sh -e -k -l

in my first try I found that those three "mpi" jobs were not being launched but in my 2nd and 3rd try (changed nothing on script) the three mpi jobs were being launched. So I guess something was wrong in the machine in my 1st try.

BTW, what's -m mean here ?

another question: after I ran rt.sh and I realized that there is something wrong in my test and I need to kill it, what's the right way to do that ? what I did is use "ps" to find out that job ID #, then kill xxx but for this method I found I have to run "kill" many times before the job is finally terminated.

1 reply

DusanJovic-NOAA Aug 19, 2024
Maintainer

  -m  compare against new baseline results

This option allows you to run the rt.sh to verify against your newly created baselines. The baselines saved in your directory before they are moved to common staging area. This is how you can confirm that your code that requires new baselines is reproducible.

If you want to kill just one test, you need to find the job id of that test, slurm machines it's a squeue command, and once you know the job id you can just cancel the job scancel <jobid>.

jiandewang · 2024-08-19T12:47:30Z

jiandewang
Aug 19, 2024
Author

(1) for -m, there is no need to specify the newly generated BL dir path in rt.sh (it's at ........FV3_RT/REGRESSION_TEST) by myself, is that right ?

(2) after I use scancel and cancelled all run and compile jobs, I found that rt.sh job is still in the sysyem (using "ps") and it will take a while for it to disappear.

1 reply

DusanJovic-NOAA Aug 19, 2024
Maintainer

(1) Yes. The location of the new baselines is always "NEW_BASELINE=${STMP}/${USER}/FV3_RT/REGRESSION_TEST", so that both ./rt.sh -c .... know where to save new baselines and ./rt.sh -m ... where to find it.

(2) If you want to cancel all jobs, just cancel rt.sh, it should cancel all submitted jobs.

Answer selected by jiandewang

rt.sh job dependency #2400

Uh oh!

jiandewang Aug 18, 2024

Replies: 9 comments · 7 replies

Uh oh!

DeniseWorthen Aug 18, 2024 Maintainer

Uh oh!

jiandewang Aug 18, 2024 Author

Uh oh!

Uh oh!

DeniseWorthen Aug 18, 2024 Maintainer

Uh oh!

jiandewang Aug 18, 2024 Author

Uh oh!

DusanJovic-NOAA Aug 19, 2024 Maintainer

Uh oh!

DeniseWorthen Aug 19, 2024 Maintainer

Uh oh!

DusanJovic-NOAA Aug 19, 2024 Maintainer

Uh oh!

Uh oh!

DeniseWorthen Aug 19, 2024 Maintainer

Uh oh!

jiandewang Aug 19, 2024 Author

Uh oh!

Uh oh!

DusanJovic-NOAA Aug 19, 2024 Maintainer

Uh oh!

DeniseWorthen Aug 19, 2024 Maintainer

Uh oh!

DeniseWorthen Aug 21, 2024 Maintainer

Uh oh!

jiandewang Aug 19, 2024 Author

Uh oh!

DusanJovic-NOAA Aug 19, 2024 Maintainer

Uh oh!

jiandewang Aug 19, 2024 Author

Uh oh!

DusanJovic-NOAA Aug 19, 2024 Maintainer

jiandewang
Aug 18, 2024

Replies: 9 comments 7 replies

DeniseWorthen
Aug 18, 2024
Maintainer

jiandewang
Aug 18, 2024
Author

DeniseWorthen
Aug 18, 2024
Maintainer

jiandewang
Aug 18, 2024
Author

DusanJovic-NOAA
Aug 19, 2024
Maintainer

DeniseWorthen
Aug 19, 2024
Maintainer

DusanJovic-NOAA
Aug 19, 2024
Maintainer

DeniseWorthen Aug 19, 2024
Maintainer

jiandewang Aug 19, 2024
Author

DusanJovic-NOAA Aug 19, 2024
Maintainer

DeniseWorthen Aug 19, 2024
Maintainer

DeniseWorthen Aug 21, 2024
Maintainer

jiandewang
Aug 19, 2024
Author

DusanJovic-NOAA Aug 19, 2024
Maintainer

jiandewang
Aug 19, 2024
Author

DusanJovic-NOAA Aug 19, 2024
Maintainer