Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All in one operator rebase merge to master #1320

Merged
merged 13 commits into from
Aug 5, 2021
Merged

Conversation

Jeffwan
Copy link
Member

@Jeffwan Jeffwan commented Aug 2, 2021

Phase I (#1298 ) and Phase II (#1299 ) work are done and all-in-one-operator branch pass e2e test.
Original tensorflow controller can coexist with refactored operator.

Since we won't have massive refactor work. I think it's a good time to merge back to master.
Phase III work can be done against master laster #1318

What do you think?

/cc @kubeflow/wg-training-leads

zw0610 and others added 12 commits July 5, 2021 11:53
* Resolve GroupVersion redeclared issue

`go build` will fail since same group version is redeclared in groupversion_info.go

Signed-off-by: Jiaxin Shan <[email protected]>

* Generate tfjob 1.19.x compatible clientset

1. Update dependencies to 1.19.x compatible
2. Update update-codegen.sh to use pkg from go modules pkg
3. Generate clientsets which supports context

Signed-off-by: Jiaxin Shan <[email protected]>
* Add XGBoost controller

1. Move major controller logics from kubeflow/xgboost-operator to this repo.
2. Adapt to kubebuilder 3.0.0 change
3. Add xgboost train example
4. Leave some TODOs for code refactor later

Signed-off-by: Jiaxin Shan <[email protected]>

* Move examples from config to example/ folder
* add tensorflow api

* check in tensorflow reconciler

fix issues from comment
- Generate MXJob api and manifests
- Rewrite MXNet controller logic in reconciler pattern

Signed-off-by: Jiaxin Shan <[email protected]>
* Remove duplicate FakeWorkQueue

Signed-off-by: Jiaxin Shan <[email protected]>

* Extract utility to common

Move ConvertPodList, ConvertServiceList to common library

Signed-off-by: Jiaxin Shan <[email protected]>

* Move gang scheduing setting to common

Signed-off-by: Jiaxin Shan <[email protected]>

* Extract OnDependentCreate/Delete handler to utils

Refactor follow methods to job independent and move to common utils
- OnDependentCreateFunc
- OnDependentDeleteFunc

Signed-off-by: Jiaxin Shan <[email protected]>

* Refactor SatisfiedExpectations to a common method

Signed-off-by: Jiaxin Shan <[email protected]>
* Add docs, register in job apis

- Clean up register.go and use reference in groupversion_info.go (generated by kubebuilder) instead
- Add doc.go and register.go for framework missing them
- Remove global tag “+k8s:deepcopy-gen=package,register” and this should be taken care of by controller-gen.

Signed-off-by: Jiaxin Shan <[email protected]>

* Generate defaults and openapi for all frameworks

1. Update codegen scripts to generate defaulters and openapi spec for all frameworks
2. Create defaults.go for framework which miss it.

Signed-off-by: Jiaxin Shan <[email protected]>

* Add Kind in constants.go

Signed-off-by: Jiaxin Shan <[email protected]>

* Update update-codegen to generate openapi-gen

Signed-off-by: Jiaxin Shan <[email protected]>

* Run goimports to format the files

* Add Plural and Singular to apis constants.go

Signed-off-by: Jiaxin Shan <[email protected]>
* Fix crd installation issue

CustomResourceDefinition.apiextensions.k8s.io "tfjobs.kubeflow.org" is invalid:
spec.validation.openAPIV3Schema.properties[metadata]: Forbidden: must not specify anything other than name and generateName,
but metadata is implicitly specified

Signed-off-by: Jiaxin Shan <[email protected]>

* Clean up controllers

1. Move all ControllerInterface implementations to [framework]_controller.go
2. Clean up job.go pod.go service.go status.go
3. Follow exact same method order here https://github.com/kubeflow/common/blob/9ffa565bc60e08936f7f80cb3f491cf53f256e7f/pkg/apis/common/v1/interface.go

Signed-off-by: Jiaxin Shan <[email protected]>

* Embed common.JobController instead of TFController

I notice previous change embed TFController which means most of the methods still use tfjob client. We need to make controller consistent and `TFJobReconciler` should override `common.JobController`.

With this change, it would be very easy to make original tf operator and new reconciler work together. We also avoid one more refactor when we will remove original TFController codes

Signed-off-by: Jiaxin Shan <[email protected]>
* Use low-level controller and handlers in SetupWithManager

Signed-off-by: Jiaxin Shan <[email protected]>

* Add job validation in reconcile loop

Signed-off-by: Jiaxin Shan <[email protected]>

* Set defaults in onOwnerCreateFunc

Signed-off-by: Jiaxin Shan <[email protected]>

* Correctly update job status in apiserver

Signed-off-by: Jiaxin Shan <[email protected]>

* Remove Flag for Kubeconfig to fix flag redefined

CI reports `flag redefined: kubeconfig` issue and this is due to duplicate flag registration. See #1316 for more details.

Signed-off-by: Jiaxin Shan <[email protected]>

* Fix tensorflow job missing default port issue

This is to fix original controller issue. It leverages init to register default to scheme. In our unversal operator project, we did clean up for register.go and break the case.

For more details, please check #1317 (comment)

Signed-off-by: Jiaxin Shan <[email protected]>
@google-oss-robot google-oss-robot requested a review from a team August 2, 2021 20:41
@google-cla
Copy link

google-cla bot commented Aug 2, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 2, 2021

/hold

Please do not cancel hold. I will rebase merge it later instead of using bot. The major reason is we want to reserve the change history and don't want to squash all commits into a huge one.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 2, 2021

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

@zw0610 Can you help on this?

@google-cla
Copy link

google-cla bot commented Aug 2, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and +1 to merge manually

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: terrytangyuan
To complete the pull request process, please assign johnugeorge after the PR has been reviewed.
You can assign the PR to them by writing /assign @johnugeorge in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zw0610
Copy link
Member

zw0610 commented Aug 3, 2021

@googlebot I consent.

@gaocegege
Copy link
Member

/cc @kubeflow/wg-automl-leads

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 3, 2021

Note, we don't bring new changes to master. All the commits in this PR come from @zw0610 and me and they have been approved and merged in all-in-one-operator branch. If you think this change is too big, feel free to click original PR link to check background. This may help you expedite the review process.

image

@gaocegege
Copy link
Member

SGTM

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 3, 2021

/cc @johnugeorge @andreyvelich any comments?

@andreyvelich
Copy link
Member

Sounds good for the changes @Jeffwan.

Original tensorflow controller can coexist with refactored operator.

Since we are making changes for the tf controller, do we still need to build separate image for tf-operator controller or we will use common kubeflow-training image once we merge this PR ?

run: manifests generate fmt vet ## Run a controller from your host.
go run ./main.go

docker-build: test ## Build docker image with the manager.
Copy link
Member

@andreyvelich andreyvelich Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will we store Dockerfile for the training-operator image ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1325 ./build/images/training-operator/Dockerfile

Maybe we should merge that PR first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan Does it look good to create separate dir only for one Dockerfile or we are planing to have some other Dockerfiles under ./build/image ?
For examples we store Dockerfile in the same dir where example is located: https://github.com/kubeflow/tf-operator/tree/8430a040ad38f3ea1f2fd4b3f077d0871b4cdd52/examples/tf_sample.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean something like Dockerfile.tf and Dockerfile.training in the same folder? I think that make sense. I can document this change and make some changes later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that.
Or we can move Dockerfile to cmd/training-operator.v1/Dockerfile similar to main.go.

Copy link
Member Author

@Jeffwan Jeffwan Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point. Currently, I follow existing way to organize the files. I think these be changes for sure. Originally, we put Dockerfile and main.go in separate folders. We can consider to put them together later. I create an improvement issue to track it. #1333

build
└── images
    ├── tf_operator
    │   └── Dockerfile
cmd
├── tf-operator.v1
│   ├── app
│   │   ├── options
│   │   │   └── options.go
│   │   └── server.go
│   └── main.go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jeffwan!

Makefile Show resolved Hide resolved
Makefile Show resolved Hide resolved
@@ -0,0 +1,123 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this file to cmd/main.go within Dockerfile in the future.
What do you think @Jeffwan ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. cmd/training-operator.v1/main.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrey, I include this change in a separate PR. This PR includes most changes we made in last two stages. There's follow up PR coming to address these suggestions.

pkg/apis/mxnet/v1/groupversion_info.go Show resolved Hide resolved
@@ -57,7 +57,7 @@ func NewForConfig(c *rest.Config) (*Clientset, error) {
configShallowCopy := *c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove clientset in the future once we finish implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need more discussion. Do we want this repo to host all client sets or a separate repo?

We are not sure which project interact with training crds. If there's any, we should prioritize the story

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's track it somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I create an issue here #1332 to track this discussion

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 5, 2021

I think I addressed all the concerns and suggestions. Some of the improvements are not made in this PR. This PR won't include any direct changes. This is a rebase change from all-in-one-operator to master. Any changes should be made against all-one-one-operator branch or master directly after it merged

For p1 issues, changes are included in #1325.
For p2 issues, I create issues and they will resolve it later.

This is a clean PR to make original test work. It's safe to merge it separately.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 5, 2021

/hold cancel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants