This repository was archived by the owner on Sep 12, 2023. It is now read-only.
This repository was archived by the owner on Sep 12, 2023. It is now read-only.
Supplement more information about tfjob into podgroup when enable-gang-scheduler
is set true
in tf-operator #112
Open
Description
When enable-gang-scheduler=true
, tf-operator will create CRD podgroup
to permit gang scheduler volcano to allocate the pods. but when createing pod in func SyncPodGroup
:
createPodGroup := &v1beta1.PodGroup{
ObjectMeta: metav1.ObjectMeta{
Name: job.GetName(),
OwnerReferences: []metav1.OwnerReference{
*jc.GenOwnerReference(job),
},
},
Spec: v1beta1.PodGroupSpec{
MinMember: minAvailable.IntVal,
},
}
createdPodGroup, err := volcanoClientSet.SchedulingV1beta1().PodGroups(job.GetNamespace()).Create(createPodGroup)
Only pod infos and minMember
is set in podgroup, which resulting to function missing, as well as unpredicatable bugs during allocation.
For example, since no minResources
field is filled in podgroup, gang scheduler volcano cannot diff tfjobs from bestEffort jobs as both of the two jobs owns nil minResources
, causing all tfjobs can be inqueue
and action enqueue
, reserve
lose effort.
So in my opinion, we need to supplement more infos about tfjob into podgroup, such as minMember
, queue
as well as other fields, so as to make sure gang scheduler workers correctly.
Metadata
Metadata
Assignees
Labels
No labels