Benchmarks with variance #9

max-hoffman · 2022-08-05T21:41:27Z

This is a short benchmark-runner that loops over 1) a list of dolt versions (commitish), 2) a list of scripts to run, and outputs test latency mean and variance. I think the variance metrics can be useful to give context on which tests have variability, and to give faster PR feedback. I intentionally undershot the event count to expose which tests have variance. Tests that quickly run enough events to get a realistic variance also terminate early. I also think a test runner not embedded in specific version of Dolt is valuable. Next steps would be plugging this into a CI trigger, maybe add MySQL after.

Sample read scripts output for main, 0.40.21, and 0.40.20, each take ~2 minutes to run:

summary for main at /var/folders/h_/n5qdj2tj3n741n128t7v2d7h0000gn/T/tmp.5USkhudY/summary/main.csv
| script               | mean          | var                 |
|----------------------|---------------|---------------------|
| covering_index_scan  | 15.10451      | 1.1138508014719468  |
| groupby_scan         | 16.24212      | 1.5890066286613307  |
| index_join_scan      | 17.61924      | 0.6317204651827049  |
| index_scan           | 156.601046875 | 7.391887008220157   |
| oltp_point_select    | 0.38645       | 0.09210223454422917 |
| oltp_read_only       | 8.27044       | 0.6200027446851226  |
| select_random_points | 1.048035      | 0.4325851631666992  |
| select_random_ranges | 1.402725      | 0.2969347412780954  |
| table_scan           | 150.4124      | 7.624369498604871   |


summary for v0.40.21 at /var/folders/h_/n5qdj2tj3n741n128t7v2d7h0000gn/T/tmp.O0fqhYcT/summary/v0.40.21.csv
| script               | mean                | var                 |
|----------------------|---------------------|---------------------|
| covering_index_scan  | 14.246475           | 0.5857827811642662  |
| groupby_scan         | 15.6504             | 0.9181732957372618  |
| index_join_scan      | 18.87815            | 2.4337875991732405  |
| index_scan           | 162.60002162162164  | 8.893199692533006   |
| oltp_point_select    | 0.41635500000000003 | 0.0904226201021108  |
| oltp_read_only       | 8.550575            | 0.9403220149683675  |
| select_random_points | 0.78283             | 0.19947786391215133 |
| select_random_ranges | 1.321045            | 0.26617484519781426 |
| table_scan           | 157.79160209424083  | 11.554515638977     |


summary for v0.40.20 at /var/folders/h_/n5qdj2tj3n741n128t7v2d7h0000gn/T/tmp.O0fqhYcT/summary/v0.40.20.csv
| script               | mean               | var                 |
|----------------------|--------------------|---------------------|
| covering_index_scan  | 15.91268           | 1.5481837527442621  |
| groupby_scan         | 18.21874           | 2.5152333183281708  |
| index_join_scan      | 19.023905          | 2.7848944885825904  |
| index_scan           | 164.4929617486339  | 38.78926194056976   |
| oltp_point_select    | 0.38974            | 0.10950419280194722 |
| oltp_read_only       | 8.752              | 1.3798065919380798  |
| select_random_points | 0.839795           | 0.23001831470668266 |
| select_random_ranges | 1.57079            | 0.4907979820618458  |
| table_scan           | 156.92905729166665 | 10.185672335364126  |

timsehn · 2022-08-05T21:44:05Z

Can I get 1 or 2 decimals of precision on the output? Cool idea.

coffeegoddd

Some questions:

Are these variance results something that we publish or that end users care about? Or is this more like a dev tool you'll be using most often.
How often do you expect the variances' to change between runs?
At what variance number do I take action? And if the variance is high, what do I do?
Is this supposed to run on GitHub Actions Hosted Runners? Or is this destined for Kubernetes performance-benchmarking workers?
When will this get run? Is this supposed to be triggered by Dolt release events? Will this be triggered by pull request comments?

I don't think this repository is the right spot for this script. This repository is just for sysbench lua scripts.

If this script is supposed to fit into our current performance-benchmarking pipeline... the best (easiest) way to do that is to enable our go benchmarking tool to run in a -variance mode, that produces the same results of this script does. There will still be some infrastructure overhead with this approach too, though.

If the results of produced by this script need to be sent via email or posted to a pull request, CSV is not the desired final format. The results should also be available in html and markdown.

If this is just a dev tool you only need to run periodically, you can just check this script into the Dolt repo and run it when you need to on a dev desktop. It doesn't need to be integrated into the performance-benchmarking pipeline. And you can just email people who need the results after you get them.

max-hoffman · 2022-08-08T19:45:45Z

Re variance questions: I think the question "are there other statistics that help us understand Dolt performance" is useful generally. Only practical feedback will say whether this, or quantiles, or confidence intervals, or anything else are actually useful. The variance provides two things: 1) a new piece of information, the event latency distribution, and 2) a way to short-circuit tests that run many events per second (i.e. don't need to run 2 minutes to give accurate means for many low variance tests).

Re usage: I will start using this personally for deving perf changes, and only in that sandbox. The main motivation is that this will make me better at my job by removing the perf pipeline as a PR bottleneck. I'm not adding it to CI or releases/nightly. As long as only I am using this and I'm responsible for benchmarks I'd like to keep scripts, helpers, and docs next to each other.

A related hypothesis is that this script as a GitHub action runner can do what the existing K8s sysbench runner does, in the same amount of time as the sysbench utility tester job (~15 minutes). The two factors I'd judge the validity of that question is: 1) Is this actually useful helping me ship changes faster? 2) Do the results diverge from the results of the existing benchmark runner? (In particular hardware/network issues.) I can answer (1) by doing my regular work, which should answer (2) after a few releases. I'd consider the refactor a low priority opportunistic change that everyone who looks at perf benchmarks would want to weigh in on.

Re user-facing: We should only show additional stats to users if we think they are high-signal.

coffeegoddd · 2022-08-08T20:02:57Z

Roger. Yea, just put this script in a different repo with your benchmarking tools, or check it into dolt.

Benchmark with variance

af7c39f

max-hoffman changed the title ~~Benchmark with variance~~ Benchmarks with variance Aug 5, 2022

edits

61abf95

max-hoffman added 2 commits August 5, 2022 14:47

precision

4e0a9eb

cleanup

8e4fc04

max-hoffman marked this pull request as ready for review August 8, 2022 16:38

max-hoffman requested a review from coffeegoddd August 8, 2022 16:38

coffeegoddd requested changes Aug 8, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks with variance #9

Benchmarks with variance #9

max-hoffman commented Aug 5, 2022 •

edited

Loading

timsehn commented Aug 5, 2022

coffeegoddd left a comment

max-hoffman commented Aug 8, 2022

coffeegoddd commented Aug 8, 2022

Benchmarks with variance #9

Are you sure you want to change the base?

Benchmarks with variance #9

Conversation

max-hoffman commented Aug 5, 2022 • edited Loading

timsehn commented Aug 5, 2022

coffeegoddd left a comment

Choose a reason for hiding this comment

max-hoffman commented Aug 8, 2022

coffeegoddd commented Aug 8, 2022

max-hoffman commented Aug 5, 2022 •

edited

Loading