Currently, when we update the benchmarks, we have to review 48+ autogenerated files for unexpected changes.
Instead, we could have a script print out:
- any new or deleted lines
- any lines changed by more than +- 10%
This would help reviewers focus on significant changes.