Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare kustomization files for new operator #1322

Closed
Jeffwan opened this issue Aug 2, 2021 · 7 comments
Closed

Prepare kustomization files for new operator #1322

Jeffwan opened this issue Aug 2, 2021 · 7 comments

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Aug 2, 2021

Umbrella issue: #1318

We will need a new folder to host manifests for new operators. https://github.com/kubeflow/tf-operator/tree/master/manifests

This will also be used for integration tests.

/help

@google-oss-robot
Copy link

@Jeffwan:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Umbrella issue: #1318

We will need a new folder to host manifests for new operators. https://github.com/kubeflow/tf-operator/tree/master/manifests

This will also be used for integration tests.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 6, 2021

Currently, we use /config which is generated by kubebuilder to run CI test. Ideally, we should sync manifests to /manifests folder and it will be used by kubeflow/manifest as well

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 6, 2021

/good-first-issue

@google-oss-robot
Copy link

@Jeffwan:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deepak-muley
Copy link
Contributor

I will be posting a fix by end of the day or tomorrow.

deepak-muley added a commit to deepak-muley/tf-operator that referenced this issue Aug 12, 2021
Actions taken:
    - replaced tf-job-operator => training-operator
    - replaced kubeflow-tfjobs- => kubeflow-training-
    - moved crds for mxjobs, tgjobs, pytorchjobs and xgboostjobs from
      config/crd/bases to manifests/base/ and prefixed them with crd_
Ref: kubeflow#1322
Testing steps: To be added
Work in Progress
google-oss-robot pushed a commit that referenced this issue Aug 13, 2021
* 1322: Modified manifests to use all-in-one training-operator WIP

Actions taken:
    - replaced tf-job-operator => training-operator
    - replaced kubeflow-tfjobs- => kubeflow-training-
    - moved crds for mxjobs, tgjobs, pytorchjobs and xgboostjobs from
      config/crd/bases to manifests/base/ and prefixed them with crd_
Ref: #1322
Testing steps: To be added
Work in Progress

* 1322: synced up config/manager with manifests

Training operator was found to be working
<pre>
k -n kubeflow logs -f training-operator-694766989-pp2j4
I0812 21:43:24.739862       1 request.go:645] Throttling request took 1.048945631s, request: GET:https://172.19.0.1:443/apis/networking.k8s.io/v1?timeout=32s
2021-08-12T21:43:25.694Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-08-12T21:43:25.790Z	INFO	setup	starting manager
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.791Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.791Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.289Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.294Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.589Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.688Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.889Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.889Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.890Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.890Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting Controller
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.tf-operator	Starting Controller
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.tf-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting Controller
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting Controller
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting workers	{"worker count": 1}
</pre>

* 1322: incorporated review comments - added all resources in ClusterRole

* 1322: incorporated review comments

- now controller-gen generates the crds directly in manifests/base
  instead of config/crd/bases
- updated setup-training-operator.sh to use manifests/overlays/standalone

* 1322: removed config/crd/bases as its now getting generated in manifests

* 1322: incorporated review comments related to using separate role files

* 1322: removed image name replacement
@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 13, 2021

/priority p0

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 15, 2021

This can be closed. leader election can be separate story it's not a blocking issue

@Jeffwan Jeffwan closed this as completed Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants