Kubernetes Custom Resource and Operator for PyTorch jobs

⚠️ kubeflow/pytorch-operator is not maintained

This operator has been merged into Kubeflow Training Operator. This repository is not maintained and has been archived.

Overview

This repository contains the specification and implementation of PyTorchJob custom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes. See CRD definition

Prerequisites

Kubernetes >= 1.8
kubectl

Installing PyTorch Operator

Please refer to the installation instructions in the Kubeflow user guide. This installs pytorchjob CRD and pytorch-operator controller to manage the lifecycle of PyTorch jobs.

Creating a PyTorch Job

You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

cat examples/mnist/v1/pytorch_job_mnist_gloo.yaml

Deploy the PyTorchJob resource to start training:

kubectl create -f examples/mnist/v1/pytorch_job_mnist_gloo.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo

Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-type=master -o name)
kubectl logs -f ${PODNAME}

Monitoring a PyTorch Job

kubectl get -o yaml pytorchjobs pytorch-dist-mnist-gloo

See status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: v1
items:
- apiVersion: kubeflow.org/v1
  kind: PyTorchJob
  metadata:
    creationTimestamp: 2019-01-11T00:51:48Z
    generation: 1
    name: pytorch-dist-mnist-gloo
    namespace: default
    resourceVersion: "2146573"
    selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/pytorchjobs/pytorch-dist-mnist-gloo
    uid: 13ad0e7f-153b-11e9-b5c1-42010a80001e
  spec:
    pytorchReplicaSpecs:
      Master:
        replicas: 1
        restartPolicy: OnFailure
        template:
          spec:
            containers:
            - args:
              - --backend
              - gloo
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
              name: pytorch
              resources:
                limits:
                  nvidia.com/gpu: "1"
      Worker:
        replicas: 1
        restartPolicy: OnFailure
        template:
          spec:
            containers:
            - args:
              - --backend
              - gloo
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
              name: pytorch
              resources:
                limits:
                  nvidia.com/gpu: "1"
  status:
    completionTime: 2019-01-11T01:03:15Z
    conditions:
    - lastTransitionTime: 2019-01-11T00:51:48Z
      lastUpdateTime: 2019-01-11T00:51:48Z
      message: PyTorchJob pytorch-dist-mnist-gloo is created.
      reason: PyTorchJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2019-01-11T00:57:22Z
      lastUpdateTime: 2019-01-11T00:57:22Z
      message: PyTorchJob pytorch-dist-mnist-gloo is running.
      reason: PyTorchJobRunning
      status: "False"
      type: Running
    - lastTransitionTime: 2019-01-11T01:03:15Z
      lastUpdateTime: 2019-01-11T01:03:15Z
      message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed.
      reason: PyTorchJobSucceeded
      status: "True"
      type: Succeeded
    replicaStatuses:
      Master:
        succeeded: 1
      Worker:
        succeeded: 1
    startTime: 2019-01-11T00:57:22Z

Contributing

Please refer to the developer_guide.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
cmd/pytorch-operator.v1		cmd/pytorch-operator.v1
docs/monitoring		docs/monitoring
examples		examples
hack		hack
manifests		manifests
pkg		pkg
scripts		scripts
sdk/python		sdk/python
test		test
third_party/library		third_party/library
third_party_licenses		third_party_licenses
vendor		vendor
version		version
.gcloudignore		.gcloudignore
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
OWNERS		OWNERS
README.md		README.md
build_image.sh		build_image.sh
defaulter-gen		defaulter-gen
dependency.sh		dependency.sh
developer_guide.md		developer_guide.md
go.mod		go.mod
go.sum		go.sum
linter_config.json		linter_config.json
linter_config.yaml		linter_config.yaml
prow_config.yaml		prow_config.yaml
pytorch-operator.v1		pytorch-operator.v1
releasing.md		releasing.md
submit_release_job.sh		submit_release_job.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubernetes Custom Resource and Operator for PyTorch jobs

⚠️ kubeflow/pytorch-operator is not maintained

Overview

Prerequisites

Installing PyTorch Operator

Creating a PyTorch Job

Monitoring a PyTorch Job

Contributing

About

Releases 9

Packages

Contributors 43

Languages

License

kubeflow/pytorch-operator

Folders and files

Latest commit

History

Repository files navigation

Kubernetes Custom Resource and Operator for PyTorch jobs

⚠️ kubeflow/pytorch-operator is not maintained

Overview

Prerequisites

Installing PyTorch Operator

Creating a PyTorch Job

Monitoring a PyTorch Job

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 43

Languages

Packages