Replies: 1 comment
-
|
This should work in principle, as long as you can launch multiple Your errors indicate that one of the processes aborts, so it could be that something is wrong with that MPI run independent of AML. It is not uncommon for things to behave differently on compute nodes compared to login nodes. Again, you should be able to test that by isolating that MPI run and launching it outside of AML, ideally with the same inputs. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In my system, I use mpich for mpirun and I change "OpenMPI-single" to MPI. In our machine, one node has 32 cores. When I use commad" python3 run-QbC.py", it is worked by direct run at login node and 8 tasks(per task with 3 cores) are run simultaneously. But when I use slrum to submit job, just one job run with 3 cores. Others are not worked. The nnp-training-stdout.err file tell us" Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun noticed that process rank 2 with PID 123887 on node node3 exited on signal 6 (Aborted)." So how to submit job using slurm when I use mpich?
Beta Was this translation helpful? Give feedback.
All reactions