Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more comprehensive performance metrics #150

Open
allenwang28 opened this issue May 13, 2020 · 4 comments
Open

Add more comprehensive performance metrics #150

allenwang28 opened this issue May 13, 2020 · 4 comments

Comments

@allenwang28
Copy link
Collaborator

E.g.

  • p50, p95, p99 of examples/sec
  • Start up and wall time
@zcain117
Copy link
Contributor

This repo mainly passes metrics that the user computes - I don't think there's any way to get examples/sec after the test is over if the user's test code hasn't written that to Tensorboard. That would be a change to make in the model code.

total_wall_time is already being computed for all the tests - you can see an example here

What would be a good definition for start up time?

@zcain117
Copy link
Contributor

Oh maybe you meant to add support for percentiles for any metric written to tensorboard, not to try to compute examples/sec. That should be doable

@allenwang28
Copy link
Collaborator Author

Yep! I think mostly percentile support is what I had in mind for this feature request.

I think start up time is not as important, but that would be from the time the command executes to the time the training starts.

Another important statistic I think would be time to accuracy as well.

@zcain117
Copy link
Contributor

time_to_accuracy is also available now. A sample config that includes it: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/tree/master/metrics_handler#metric_collection_config

Start up time is possible but the user would need to write some event to Tensorboard to indicate that training has started. As a first step, we could just grab the earliest Tensorboard entry of any kind and use the delta of job start time and earliest Tensorboard entry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants