Add more comprehensive performance metrics #150

allenwang28 · 2020-05-13T17:33:31Z

E.g.

p50, p95, p99 of examples/sec
Start up and wall time

zcain117 · 2020-05-13T17:46:04Z

This repo mainly passes metrics that the user computes - I don't think there's any way to get examples/sec after the test is over if the user's test code hasn't written that to Tensorboard. That would be a change to make in the model code.

total_wall_time is already being computed for all the tests - you can see an example here

What would be a good definition for start up time?

zcain117 · 2020-05-13T17:48:25Z

Oh maybe you meant to add support for percentiles for any metric written to tensorboard, not to try to compute examples/sec. That should be doable

allenwang28 · 2020-05-13T17:49:46Z

Yep! I think mostly percentile support is what I had in mind for this feature request.

I think start up time is not as important, but that would be from the time the command executes to the time the training starts.

Another important statistic I think would be time to accuracy as well.

zcain117 · 2020-05-13T18:04:20Z

time_to_accuracy is also available now. A sample config that includes it: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/tree/master/metrics_handler#metric_collection_config

Start up time is possible but the user would need to write some event to Tensorboard to indicate that training has started. As a first step, we could just grab the earliest Tensorboard entry of any kind and use the delta of job start time and earliest Tensorboard entry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more comprehensive performance metrics #150

Add more comprehensive performance metrics #150

allenwang28 commented May 13, 2020

zcain117 commented May 13, 2020

zcain117 commented May 13, 2020

allenwang28 commented May 13, 2020

zcain117 commented May 13, 2020

Add more comprehensive performance metrics #150

Add more comprehensive performance metrics #150

Comments

allenwang28 commented May 13, 2020

zcain117 commented May 13, 2020

zcain117 commented May 13, 2020

allenwang28 commented May 13, 2020

zcain117 commented May 13, 2020