-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All in one operator rebase merge to master #1320
Conversation
* Resolve GroupVersion redeclared issue `go build` will fail since same group version is redeclared in groupversion_info.go Signed-off-by: Jiaxin Shan <[email protected]> * Generate tfjob 1.19.x compatible clientset 1. Update dependencies to 1.19.x compatible 2. Update update-codegen.sh to use pkg from go modules pkg 3. Generate clientsets which supports context Signed-off-by: Jiaxin Shan <[email protected]>
Signed-off-by: Jiaxin Shan <[email protected]>
* Add XGBoost controller 1. Move major controller logics from kubeflow/xgboost-operator to this repo. 2. Adapt to kubebuilder 3.0.0 change 3. Add xgboost train example 4. Leave some TODOs for code refactor later Signed-off-by: Jiaxin Shan <[email protected]> * Move examples from config to example/ folder
* add tensorflow api * check in tensorflow reconciler fix issues from comment
- Generate MXJob api and manifests - Rewrite MXNet controller logic in reconciler pattern Signed-off-by: Jiaxin Shan <[email protected]>
* Remove duplicate FakeWorkQueue Signed-off-by: Jiaxin Shan <[email protected]> * Extract utility to common Move ConvertPodList, ConvertServiceList to common library Signed-off-by: Jiaxin Shan <[email protected]> * Move gang scheduing setting to common Signed-off-by: Jiaxin Shan <[email protected]> * Extract OnDependentCreate/Delete handler to utils Refactor follow methods to job independent and move to common utils - OnDependentCreateFunc - OnDependentDeleteFunc Signed-off-by: Jiaxin Shan <[email protected]> * Refactor SatisfiedExpectations to a common method Signed-off-by: Jiaxin Shan <[email protected]>
* Add docs, register in job apis - Clean up register.go and use reference in groupversion_info.go (generated by kubebuilder) instead - Add doc.go and register.go for framework missing them - Remove global tag “+k8s:deepcopy-gen=package,register” and this should be taken care of by controller-gen. Signed-off-by: Jiaxin Shan <[email protected]> * Generate defaults and openapi for all frameworks 1. Update codegen scripts to generate defaulters and openapi spec for all frameworks 2. Create defaults.go for framework which miss it. Signed-off-by: Jiaxin Shan <[email protected]> * Add Kind in constants.go Signed-off-by: Jiaxin Shan <[email protected]> * Update update-codegen to generate openapi-gen Signed-off-by: Jiaxin Shan <[email protected]> * Run goimports to format the files * Add Plural and Singular to apis constants.go Signed-off-by: Jiaxin Shan <[email protected]>
* Fix crd installation issue CustomResourceDefinition.apiextensions.k8s.io "tfjobs.kubeflow.org" is invalid: spec.validation.openAPIV3Schema.properties[metadata]: Forbidden: must not specify anything other than name and generateName, but metadata is implicitly specified Signed-off-by: Jiaxin Shan <[email protected]> * Clean up controllers 1. Move all ControllerInterface implementations to [framework]_controller.go 2. Clean up job.go pod.go service.go status.go 3. Follow exact same method order here https://github.com/kubeflow/common/blob/9ffa565bc60e08936f7f80cb3f491cf53f256e7f/pkg/apis/common/v1/interface.go Signed-off-by: Jiaxin Shan <[email protected]> * Embed common.JobController instead of TFController I notice previous change embed TFController which means most of the methods still use tfjob client. We need to make controller consistent and `TFJobReconciler` should override `common.JobController`. With this change, it would be very easy to make original tf operator and new reconciler work together. We also avoid one more refactor when we will remove original TFController codes Signed-off-by: Jiaxin Shan <[email protected]>
* Use low-level controller and handlers in SetupWithManager Signed-off-by: Jiaxin Shan <[email protected]> * Add job validation in reconcile loop Signed-off-by: Jiaxin Shan <[email protected]> * Set defaults in onOwnerCreateFunc Signed-off-by: Jiaxin Shan <[email protected]> * Correctly update job status in apiserver Signed-off-by: Jiaxin Shan <[email protected]> * Remove Flag for Kubeconfig to fix flag redefined CI reports `flag redefined: kubeconfig` issue and this is due to duplicate flag registration. See #1316 for more details. Signed-off-by: Jiaxin Shan <[email protected]> * Fix tensorflow job missing default port issue This is to fix original controller issue. It leverages init to register default to scheme. In our unversal operator project, we did clean up for register.go and break the case. For more details, please check #1317 (comment) Signed-off-by: Jiaxin Shan <[email protected]>
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
/hold Please do not cancel hold. I will rebase merge it later instead of using bot. The major reason is we want to reserve the change history and don't want to squash all commits into a huge one. |
@zw0610 Can you help on this? |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and +1 to merge manually
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: terrytangyuan The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@googlebot I consent. |
/cc @kubeflow/wg-automl-leads |
Note, we don't bring new changes to master. All the commits in this PR come from @zw0610 and me and they have been approved and merged in |
SGTM |
/cc @johnugeorge @andreyvelich any comments? |
Sounds good for the changes @Jeffwan.
Since we are making changes for the tf controller, do we still need to build separate image for tf-operator controller or we will use common |
run: manifests generate fmt vet ## Run a controller from your host. | ||
go run ./main.go | ||
|
||
docker-build: test ## Build docker image with the manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where will we store Dockerfile for the training-operator
image ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1325 ./build/images/training-operator/Dockerfile
Maybe we should merge that PR first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jeffwan Does it look good to create separate dir only for one Dockerfile
or we are planing to have some other Dockerfiles
under ./build/image
?
For examples we store Dockerfile
in the same dir where example is located: https://github.com/kubeflow/tf-operator/tree/8430a040ad38f3ea1f2fd4b3f077d0871b4cdd52/examples/tf_sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean something like Dockerfile.tf
and Dockerfile.training
in the same folder? I think that make sense. I can document this change and make some changes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that.
Or we can move Dockerfile
to cmd/training-operator.v1/Dockerfile
similar to main.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point. Currently, I follow existing way to organize the files. I think these be changes for sure. Originally, we put Dockerfile and main.go in separate folders. We can consider to put them together later. I create an improvement issue to track it. #1333
build
└── images
├── tf_operator
│ └── Dockerfile
cmd
├── tf-operator.v1
│ ├── app
│ │ ├── options
│ │ │ └── options.go
│ │ └── server.go
│ └── main.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jeffwan!
@@ -0,0 +1,123 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move this file to cmd/main.go
within Dockerfile in the future.
What do you think @Jeffwan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. cmd/training-operator.v1/main.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Andrey, I include this change in a separate PR. This PR includes most changes we made in last two stages. There's follow up PR coming to address these suggestions.
@@ -57,7 +57,7 @@ func NewForConfig(c *rest.Config) (*Clientset, error) { | |||
configShallowCopy := *c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to remove clientset
in the future once we finish implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need more discussion. Do we want this repo to host all client sets or a separate repo?
We are not sure which project interact with training crds. If there's any, we should prioritize the story
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, let's track it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I create an issue here #1332 to track this discussion
I think I addressed all the concerns and suggestions. Some of the improvements are not made in this PR. This PR won't include any direct changes. This is a rebase change from For p1 issues, changes are included in #1325. This is a clean PR to make original test work. It's safe to merge it separately. |
/hold cancel |
Phase I (#1298 ) and Phase II (#1299 ) work are done and
all-in-one-operator
branch pass e2e test.Original tensorflow controller can coexist with refactored operator.
Since we won't have massive refactor work. I think it's a good time to merge back to master.
Phase III work can be done against master laster #1318
What do you think?
/cc @kubeflow/wg-training-leads