-
Notifications
You must be signed in to change notification settings - Fork 73
Unified training operator working progress #138
Comments
Things to figure out.
|
An update on above items. @zw0610 @kubeflow/wg-training-leads
reuse tf-operator and rename to all issues, commits, followers, start will be transferred to new repo.
kubernetes 1.19.x kubebuilder 3.0.0 controller-runtime v0.7.2
reuse our PROW test jobs in
Start from v1 API since we plan to reuse most of the existing specs in phase 1.
use tf-operator separate develop branch (July 16) -> when features are all ready, merge back to master (2 weeks review by training leads) -> clean up code base (1 week) -> rename the repo (1month and catch 1.4 release) We plan to have an alpha rc release by training & automl summit. (July 16). |
Thank you for driving this @Jeffwan!
Is there any limitation why we need to use Kubernetes 1.19 ? Can we just jump to 1.20 or even to the latest 1.21 version ?
Does it mean that we also drop SDK support ? Or we are talking only about clientset, listers, informers ? |
Yeah, this is flexible. Since current repo use lower version. We plan to have a 1.19 as a start and then jump to 1.21 once we merge back to master. Just in case someone user lower version and we want to have a tag or release for those users.
Yeah, you are right. Python SDK will be supported. I mean clientsets. controller itself use higher level client and doesn't need clientsets. BTW. does Katib use them? |
Sounds good @Jeffwan.
No, we are only using APIs from the TFJob: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L28 to validate TFJob, etc. But this also can be omitted from our side since it's not necessary. cc @kubeflow/wg-automl-leads |
@Jeffwan Great. Can we merge code in phase as review will be easier? |
@johnugeorge sure. I will cc all training leads for PRs coming into feature branch. |
@zw0610 and I present all-in-one training operator proposal in last month community meeting.
WG-Training leads have already agreed to move forward. This issue is created to track implementation progress. The desired alpha release of this new unified operator will be Kubeflow 1.4
Configuration and deployment
Custom Resources
Observability
CI/CD
Docs
Owners/Maintenance
Adoption
The text was updated successfully, but these errors were encountered: