This document provides a high-level overview of where MPI Operator will grow in future releases. See discussions in the original RFC here.
- Decouple the tight dependency on Open MPI and support other collective communication frameworks. Related issue: #12.
- Support new versions of MPI Operator in kubeflow/manifests.
- Redesign different components of MPI Operator to support fault tolerant collective communication frameworks such as caicloud/ftlib.
- Allow more flexible RBAC when
MPIJob
s so existing RBAC resources can be reused. Related issue: #20. - Support installation of MPI Operator via Helm. Related issue: #11.
- Support Go modules.
- Consider support launching framework-specific services such as TensorBoard and Horovod Timeline. Since tf-operator already supports TensorBoard, we may want to consider moving this to kubeflow/common so it can be reused. Related issue: #138.
- Automate the process to publish images to Docker Hub whenever there's new release/commit. Related issue: #93.
- Ensure new versions of
deploy/mpi-operator.yaml
are always compatible with kubeflow/manifests. - Add end-to-end tests via Kubeflow's testing infrastructure. Related issue: #9.
- Better statuses of launcher and worker pods. Related issues: #90