-
Notifications
You must be signed in to change notification settings - Fork 14
Optimize predictors 0-9 in lossless_transform
#152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rewrite predictor transform 1 to be autovectorized. Add an average function which performs average with bitwise operations. LLVM is capable of doing this optimization but it's not able to automatically perform it for every transform we have. Add asserts to remove some saturating arithmetic instructions. Predictors 10-13 were much slower when rewritten following this pattern.
I suspect this PR is working around around this regression: rust-lang/rust#142519 |
Interesting 🤔 I'm working on a longer analysis for the major improvements to post here, but for predictor 1, it doesn't appear like that ever auto-vectorized (with the chunk form)? |
Background context and reasoning for the previous predictor optimizationsReproducing the chart from #94,
1, 2, 7, 11, 12, and 13 are all over 5% for that corpus. Analysis of changes in this PRI've added an assert for checking the Predictor 0https://rust.godbolt.org/z/qarPb7aEo I think this one is noisy but a slight improvement. Predictor 1before vs. after
Predictor 5https://rust.godbolt.org/z/MPP7qx98z Auto-vectorization from calculating the average using Predictor 6https://rust.godbolt.org/z/895n6hE8o This is the only one I'm somewhat hesitant about.
Even using the Predictor 7https://rust.godbolt.org/z/89je9nsfe Auto-vectorization from the bitwise-ops average. Predictors 10-13 already auto-vectorize and use the bitwise trick for |
Not helpful until MSRV is 1.88, but it looks like that regression can be worked around with Feel free to share that in the png or Rust issues if you think it's relevant. |
I see a ~1% end-to-end speedup from this change. At this point, roughly 20% of total decode time (of lossless images) is spent in the transforms. The rough breakdown is:
|
Makes sense that those predictors show up. Not sure how much more we can tease out without dipping into intrinsics. |
average2
calls which are left unchanged, but it's not able to automatically perform it for every transform we have.Predictors 10-13 were much slower when rewritten following this pattern.
Related
#95, #96
Rust 1.82 (2024-10-17) upgraded to LLVM 19 which auto-vectorizes the old
while
constructions. This allowed for rewriting predictor 1 to be 3x faster.1, 5, 6, and 7 are the decisive improvements in the benchmarks.
before
after