-
Notifications
You must be signed in to change notification settings - Fork 227
Open
Description
Hello,
we noticed that mpirun will not run correctly when dots are used in MPIJob names.
For example
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: myjob.1
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
restartPolicy: OnFailure
replicas: 1
template:
spec:
containers:
- image: IMAGE
name: launcher
imagePullPolicy: Always
command:
- mpirun
- --allow-run-as-root
- -np
- "2"
Worker:
replicas: 2
template:
spec:
containers:
- image: IMAGE
name: worker
imagePullPolicy: Always
will lead to this error message when mpirun
is executed:
A hostfile was provided that contains multiple definitions
of the slot count for at least one node:
hostfile: hosts
node: mpi-worker
You can either list a node multiple times, once for each slot,
or you can provide a single line that contains "slot=N". Mixing
the two methods is not supported.
Please correct the hostfile and try again.
In the image, OpenMPI v4 was installed.
I assume this is caused by how openmpi interprets the hostnames which will include the dot from the MPIJob name. Also see open-mpi/ompi#4732 (comment) for a related discussion.
This can also lead to the case where the mpirun
command runs successfully but only one worker is used.
Just mentioning it here as well in case someone stumbles upon this. We will probably validate the MPIJob name on creation or check different mpi settings like -mca orte_keep_fqdn_hostnames t
.
Metadata
Metadata
Assignees
Labels
No labels