Skip to content

Commit

Permalink
1322: Modified manifests to use all-in-one training-operator (#1346)
Browse files Browse the repository at this point in the history
* 1322: Modified manifests to use all-in-one training-operator WIP

Actions taken:
    - replaced tf-job-operator => training-operator
    - replaced kubeflow-tfjobs- => kubeflow-training-
    - moved crds for mxjobs, tgjobs, pytorchjobs and xgboostjobs from
      config/crd/bases to manifests/base/ and prefixed them with crd_
Ref: #1322
Testing steps: To be added
Work in Progress

* 1322: synced up config/manager with manifests

Training operator was found to be working
<pre>
k -n kubeflow logs -f training-operator-694766989-pp2j4
I0812 21:43:24.739862       1 request.go:645] Throttling request took 1.048945631s, request: GET:https://172.19.0.1:443/apis/networking.k8s.io/v1?timeout=32s
2021-08-12T21:43:25.694Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-08-12T21:43:25.790Z	INFO	setup	starting manager
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.791Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.791Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.289Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.294Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.589Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.688Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.889Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.889Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.890Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.890Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting Controller
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.tf-operator	Starting Controller
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.tf-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting Controller
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting Controller
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting workers	{"worker count": 1}
</pre>

* 1322: incorporated review comments - added all resources in ClusterRole

* 1322: incorporated review comments

- now controller-gen generates the crds directly in manifests/base
  instead of config/crd/bases
- updated setup-training-operator.sh to use manifests/overlays/standalone

* 1322: removed config/crd/bases as its now getting generated in manifests

* 1322: incorporated review comments related to using separate role files

* 1322: removed image name replacement
  • Loading branch information
deepak-muley authored Aug 13, 2021
1 parent 3e11cde commit 46c5864
Show file tree
Hide file tree
Showing 46 changed files with 180 additions and 331 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ help: ## Display this help.
##@ Development

manifests: controller-gen ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects.
$(CONTROLLER_GEN) $(CRD_OPTIONS) rbac:roleName=manager-role webhook paths="./pkg/apis/..." output:crd:artifacts:config=config/crd/bases
$(CONTROLLER_GEN) $(CRD_OPTIONS) rbac:roleName=manager-role webhook paths="./pkg/apis/..." output:crd:artifacts:config=manifests/base

generate: controller-gen ## Generate code containing DeepCopy, DeepCopyInto, and DeepCopyObject method implementations.
$(CONTROLLER_GEN) object:headerFile="hack/boilerplate.go.txt" paths="./pkg/apis/..."
Expand Down
13 changes: 0 additions & 13 deletions config/crd/kustomization.yaml

This file was deleted.

19 changes: 0 additions & 19 deletions config/crd/kustomizeconfig.yaml

This file was deleted.

57 changes: 0 additions & 57 deletions config/manager/manager.yaml

This file was deleted.

8 changes: 4 additions & 4 deletions manifests/base/cluster-role-binding.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
labels:
app: tf-job-operator
name: tf-job-operator
app: training-operator
name: training-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tf-job-operator
name: training-operator
subjects:
- kind: ServiceAccount
name: tf-job-operator
name: training-operator
139 changes: 43 additions & 96 deletions manifests/base/cluster-role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,100 +3,47 @@ apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
labels:
app: tf-job-operator
name: tf-job-operator
app: training-operator
name: training-operator
rules:
- apiGroups:
- kubeflow.org
resources:
- tfjobs
- tfjobs/status
- tfjobs/finalizers
verbs:
- '*'
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- '*'
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- events
verbs:
- '*'
- apiGroups:
- apps
- extensions
resources:
- deployments
verbs:
- '*'
- apiGroups:
- scheduling.volcano.sh
resources:
- podgroups
verbs:
- '*'

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-tfjobs-admin
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true"
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-tfjobs-admin: "true"
rules: []

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-tfjobs-edit
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-tfjobs-admin: "true"
rules:
- apiGroups:
- kubeflow.org
resources:
- tfjobs
- tfjobs/status
verbs:
- get
- list
- watch
- create
- delete
- deletecollection
- patch
- update

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-tfjobs-view
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true"
rules:
- apiGroups:
- kubeflow.org
resources:
- tfjobs
- tfjobs/status
verbs:
- get
- list
- watch
- apiGroups:
- kubeflow.org
resources:
- tfjobs
- mxjobs
- pytorchjobs
- xgboostjobs
- tfjobs/status
- pytorchjobs/status
- mxjobs/status
- xgboostjobs/status
verbs:
- "*"
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- "*"
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- events
verbs:
- "*"
- apiGroups:
- apps
- extensions
resources:
- deployments
verbs:
- "*"
- apiGroups:
- scheduling.volcano.sh
resources:
- podgroups
verbs:
- "*"
52 changes: 0 additions & 52 deletions manifests/base/crd.yaml

This file was deleted.

29 changes: 0 additions & 29 deletions manifests/base/deployment.yaml

This file was deleted.

File renamed without changes.
File renamed without changes.
19 changes: 8 additions & 11 deletions manifests/base/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,11 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow
resources:
- crd.yaml
- cluster-role-binding.yaml
- cluster-role.yaml
- deployment.yaml
- service-account.yaml
- service.yaml
commonLabels:
app: tf-job-operator
kustomize.component: tf-job-operator
app.kubernetes.io/component: tfjob
app.kubernetes.io/name: tf-job-operator
- kubeflow.org_tfjobs.yaml
- kubeflow.org_mxjobs.yaml
- kubeflow.org_pytorchjobs.yaml
- kubeflow.org_xgboostjobs.yaml
- cluster-role-binding.yaml
- cluster-role.yaml
- service-account.yaml
- service.yaml
4 changes: 2 additions & 2 deletions manifests/base/service-account.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: tf-job-operator
name: tf-job-operator
app: training-operator
name: training-operator
6 changes: 3 additions & 3 deletions manifests/base/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ metadata:
prometheus.io/scrape: "true"
prometheus.io/port: "8443"
labels:
app: tf-job-operator
name: tf-job-operator
app: training-operator
name: training-operator
spec:
ports:
- name: monitoring-port
port: 8443
targetPort: 8443
selector:
name: tf-job-operator
name: training-operator
type: ClusterIP
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 46c5864

Please sign in to comment.