Skip to content
This repository was archived by the owner on Sep 12, 2023. It is now read-only.
This repository was archived by the owner on Sep 12, 2023. It is now read-only.

Supplement more information about tfjob into podgroup when enable-gang-scheduler is set true in tf-operator #112

Open
@jiangkaihua

Description

@jiangkaihua

When enable-gang-scheduler=true, tf-operator will create CRD podgroup to permit gang scheduler volcano to allocate the pods. but when createing pod in func SyncPodGroup:

func (jc *JobController) SyncPodGroup(job metav1.Object, minAvailableReplicas int32) (*v1beta1.PodGroup, error) {

	createPodGroup := &v1beta1.PodGroup{
		ObjectMeta: metav1.ObjectMeta{
			Name: job.GetName(),
			OwnerReferences: []metav1.OwnerReference{
				*jc.GenOwnerReference(job),
			},
		},
		Spec: v1beta1.PodGroupSpec{
			MinMember: minAvailable.IntVal,
		},
	}
	createdPodGroup, err := volcanoClientSet.SchedulingV1beta1().PodGroups(job.GetNamespace()).Create(createPodGroup)

Only pod infos and minMember is set in podgroup, which resulting to function missing, as well as unpredicatable bugs during allocation.

For example, since no minResources field is filled in podgroup, gang scheduler volcano cannot diff tfjobs from bestEffort jobs as both of the two jobs owns nil minResources, causing all tfjobs can be inqueue and action enqueue , reserve lose effort.

So in my opinion, we need to supplement more infos about tfjob into podgroup, such as minMember, queue as well as other fields, so as to make sure gang scheduler workers correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions