Skip to content

Add a metric to track job creation to pod creation time. #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wonderyl
Copy link

@wonderyl wonderyl commented Jan 15, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a metric to measure latency from job creation to pod creation in JobLifecycleLatency.

Which issue(s) this PR fixes:

n/a

Special notes for your reviewer:

Passed unit tests

$ go test ./...
ok      k8s.io/perf-tests/clusterloader2/api    (cached)
?       k8s.io/perf-tests/clusterloader2/cmd    [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/chaos      [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/errors     [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/execservice        [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/flags      [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/framework  [no test files]
ok      k8s.io/perf-tests/clusterloader2/pkg/config     (cached)
?       k8s.io/perf-tests/clusterloader2/pkg/framework/config   [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/imagepreload       [no test files]
ok      k8s.io/perf-tests/clusterloader2/pkg/framework/client   (cached)
?       k8s.io/perf-tests/clusterloader2/pkg/measurement        [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/common/bundle  [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/common/dns     [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/common/executors       [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/common/metrics [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/common/network [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/common/network-policy  [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/common/probes  [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/util/checker   [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/util/gatherers [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/util/informer  [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/util/kubelet   [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/util/kubemark  [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/measurement/util/workerqueue       [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/metadata   [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/prometheus/clients [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/provider   [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/state      [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/test       [no test files]
?       k8s.io/perf-tests/clusterloader2/pkg/tuningset  [no test files]
ok      k8s.io/perf-tests/clusterloader2/pkg/measurement/common 5.871s
ok      k8s.io/perf-tests/clusterloader2/pkg/measurement/common/slos    5.515s
ok      k8s.io/perf-tests/clusterloader2/pkg/measurement/util   (cached)
ok      k8s.io/perf-tests/clusterloader2/pkg/measurement/util/runtimeobjects    (cached)
ok      k8s.io/perf-tests/clusterloader2/pkg/modifier   (cached)
ok      k8s.io/perf-tests/clusterloader2/pkg/prometheus (cached)
ok      k8s.io/perf-tests/clusterloader2/pkg/util       (cached)

@wonderyl wonderyl self-assigned this Jan 15, 2025
@wonderyl
Copy link
Author

@microsoft-github-policy-service agree

@wonderyl wonderyl changed the base branch from master to main January 19, 2025 05:48
@@ -60,6 +61,8 @@ func createJobLifecycleLatencyMeasurement() measurement.Measurement {
selector: util.NewObjectSelector(),
jobStateEntries: measurementutil.NewObjectTransitionTimes(jobLifecycleLatencyMeasurementName),
eventQueue: workqueue.New(),
podCreationTime: measurementutil.NewPodCreationEventTimes(),
eventTicker: time.NewTicker(time.Minute),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the measurement will run very 1 minute to collect data right? I wonder if we should do second for more fine-grained result? Or allow users to customize the frequency with a variable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but the events are stored in ETCD for a while, every time we list, it will return all the events in its cache. As long as the interval is smaller than the life span the cache, it should not miss anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Anson, I'm working on a new version that uses informer of pods instead of ListEvents.

}
}

time.Sleep(2 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to wait for 2 seconds in the test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticker is set to 1 second. It sleeps for 2 seconds so that at least 1 tick is done.

}
gotP50 := getMetric(createToPodStart, "Perc50")
if gotP50 != 3*time.Minute {
t.Errorf("Expect create_to_pod_start Perc50 = 3 minutes, got %v", gotP50)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of measurement reports time in second so you might want to consider second instead of minute to align with upstream

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Second" is tested in Line 170, right? Are you asking for every test case to use seconds as unit?

@@ -265,6 +262,13 @@ func ListEvents(c clientset.Interface, namespace string, name string, options ..
return obj, nil
}

// ListEvents retrieves events for the object with the given name.
func ListEvents(c clientset.Interface, namespace string, name string, options ...*APICallOptions) (obj *apiv1.EventList, err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the order of these two functions, it will make resolve conflicts easier

@@ -130,6 +135,7 @@ func (p *jobLifecycleLatencyMeasurement) start(c clientset.Interface) error {
p.addEvent,
)
go p.processEvents()
go measurementutil.RunEveryTick(p.eventTicker, p.getFuncToListJobEvents(c), p.stopCh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do we have to measure periodically? or it can be event driven

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants