Skip to content

MPIJobs with dots in name will lead to wrong hostfiles #733

@hahahannes

Description

@hahahannes

Hello,

we noticed that mpirun will not run correctly when dots are used in MPIJob names.

For example

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: myjob.1
spec:
slotsPerWorker: 1 
  runPolicy:
    cleanPodPolicy: Running 
  mpiReplicaSpecs:
    Launcher:
      restartPolicy: OnFailure
      replicas: 1
      template:
        spec:
          containers:
          - image: IMAGE
            name: launcher
            imagePullPolicy: Always
            command:
                - mpirun 
                - --allow-run-as-root 
                - -np 
                - "2"
    Worker:
      replicas: 2 
      template:
        spec:
          containers:
          - image: IMAGE
            name: worker
            imagePullPolicy: Always

will lead to this error message when mpirun is executed:

A hostfile was provided that contains multiple definitions
of the slot count for at least one node:

  hostfile:  hosts
  node:      mpi-worker

You can either list a node multiple times, once for each slot,
or you can provide a single line that contains "slot=N". Mixing
the two methods is not supported.

Please correct the hostfile and try again.

In the image, OpenMPI v4 was installed.

I assume this is caused by how openmpi interprets the hostnames which will include the dot from the MPIJob name. Also see open-mpi/ompi#4732 (comment) for a related discussion.

This can also lead to the case where the mpirun command runs successfully but only one worker is used.

Just mentioning it here as well in case someone stumbles upon this. We will probably validate the MPIJob name on creation or check different mpi settings like -mca orte_keep_fqdn_hostnames t.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions