-
Notifications
You must be signed in to change notification settings - Fork 7
Add a metric to track job creation to pod creation time. #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree |
@@ -60,6 +61,8 @@ func createJobLifecycleLatencyMeasurement() measurement.Measurement { | |||
selector: util.NewObjectSelector(), | |||
jobStateEntries: measurementutil.NewObjectTransitionTimes(jobLifecycleLatencyMeasurementName), | |||
eventQueue: workqueue.New(), | |||
podCreationTime: measurementutil.NewPodCreationEventTimes(), | |||
eventTicker: time.NewTicker(time.Minute), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the measurement will run very 1 minute to collect data right? I wonder if we should do second for more fine-grained result? Or allow users to customize the frequency with a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but the events are stored in ETCD for a while, every time we list, it will return all the events in its cache. As long as the interval is smaller than the life span the cache, it should not miss anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to Anson, I'm working on a new version that uses informer of pods instead of ListEvents.
} | ||
} | ||
|
||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to wait for 2 seconds in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ticker is set to 1 second. It sleeps for 2 seconds so that at least 1 tick is done.
} | ||
gotP50 := getMetric(createToPodStart, "Perc50") | ||
if gotP50 != 3*time.Minute { | ||
t.Errorf("Expect create_to_pod_start Perc50 = 3 minutes, got %v", gotP50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The majority of measurement reports time in second so you might want to consider second instead of minute to align with upstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Second" is tested in Line 170, right? Are you asking for every test case to use seconds as unit?
@@ -265,6 +262,13 @@ func ListEvents(c clientset.Interface, namespace string, name string, options .. | |||
return obj, nil | |||
} | |||
|
|||
// ListEvents retrieves events for the object with the given name. | |||
func ListEvents(c clientset.Interface, namespace string, name string, options ...*APICallOptions) (obj *apiv1.EventList, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the order of these two functions, it will make resolve conflicts easier
@@ -130,6 +135,7 @@ func (p *jobLifecycleLatencyMeasurement) start(c clientset.Interface) error { | |||
p.addEvent, | |||
) | |||
go p.processEvents() | |||
go measurementutil.RunEveryTick(p.eventTicker, p.getFuncToListJobEvents(c), p.stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or do we have to measure periodically? or it can be event driven
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds a metric to measure latency from job creation to pod creation in JobLifecycleLatency.
Which issue(s) this PR fixes:
n/a
Special notes for your reviewer:
Passed unit tests