Optimize predictors 0-9 in `lossless_transform` #152

okaneco · 2025-07-11T23:19:57Z

Rewrite predictor transform 1 to be auto-vectorized.
Add an average function which performs average with bitwise operations.
- LLVM is capable of doing this optimization and does it for the average2 calls which are left unchanged, but it's not able to automatically perform it for every transform we have.
Add asserts to remove some saturating arithmetic instructions.

Predictors 10-13 were much slower when rewritten following this pattern.

Related
#95, #96

Rust 1.82 (2024-10-17) upgraded to LLVM 19 which auto-vectorizes the old while constructions. This allowed for rewriting predictor 1 to be 3x faster.

1, 5, 6, and 7 are the decisive improvements in the benchmarks.

before


test lossless_transform::benches::predictor00     ... bench:          98.81 ns/iter (+/- 2.49) = 10408 MB/s
test lossless_transform::benches::predictor01     ... bench:         308.75 ns/iter (+/- 19.58) = 3311 MB/s
test lossless_transform::benches::predictor02     ... bench:          42.42 ns/iter (+/- 1.10) = 24285 MB/s
test lossless_transform::benches::predictor03     ... bench:          41.83 ns/iter (+/- 1.23) = 24878 MB/s
test lossless_transform::benches::predictor04     ... bench:          39.81 ns/iter (+/- 3.22) = 26153 MB/s
test lossless_transform::benches::predictor05     ... bench:         711.89 ns/iter (+/- 17.23) = 1434 MB/s
test lossless_transform::benches::predictor06     ... bench:       1,519.71 ns/iter (+/- 50.47) = 671 MB/s
test lossless_transform::benches::predictor07     ... bench:         518.31 ns/iter (+/- 4.46) = 1969 MB/s
test lossless_transform::benches::predictor08     ... bench:          70.32 ns/iter (+/- 8.86) = 14571 MB/s
test lossless_transform::benches::predictor09     ... bench:          65.92 ns/iter (+/- 1.98) = 15692 MB/s

after


test lossless_transform::benches::predictor00     ... bench:          91.49 ns/iter (+/- 1.01) = 11208 MB/s
test lossless_transform::benches::predictor01     ... bench:          90.69 ns/iter (+/- 2.70) = 11333 MB/s
test lossless_transform::benches::predictor02     ... bench:          36.57 ns/iter (+/- 1.10) = 28333 MB/s
test lossless_transform::benches::predictor03     ... bench:          34.14 ns/iter (+/- 5.56) = 30000 MB/s
test lossless_transform::benches::predictor04     ... bench:          35.98 ns/iter (+/- 2.84) = 29142 MB/s
test lossless_transform::benches::predictor05     ... bench:         466.55 ns/iter (+/- 29.46) = 2188 MB/s
test lossless_transform::benches::predictor06     ... bench:         717.89 ns/iter (+/- 48.46) = 1422 MB/s
test lossless_transform::benches::predictor07     ... bench:         268.98 ns/iter (+/- 9.90) = 3805 MB/s
test lossless_transform::benches::predictor08     ... bench:          63.66 ns/iter (+/- 1.51) = 16190 MB/s
test lossless_transform::benches::predictor09     ... bench:          61.02 ns/iter (+/- 7.13) = 16721 MB/s

Rewrite predictor transform 1 to be autovectorized. Add an average function which performs average with bitwise operations. LLVM is capable of doing this optimization but it's not able to automatically perform it for every transform we have. Add asserts to remove some saturating arithmetic instructions. Predictors 10-13 were much slower when rewritten following this pattern.

fintelia · 2025-07-11T23:36:26Z

I suspect this PR is working around around this regression: rust-lang/rust#142519

okaneco · 2025-07-11T23:41:40Z

Interesting 🤔

I'm working on a longer analysis for the major improvements to post here, but for predictor 1, it doesn't appear like that ever auto-vectorized (with the chunk form)?
https://rust.godbolt.org/z/T88Kq66cG

okaneco · 2025-07-12T00:26:34Z

Background context and reasoning for the previous predictor optimizations

Reproducing the chart from #94,

Commonly used predictors by libwebp (determined by the number of pixels using each predictor across a corpus encoded by cwebp):

predictor[0]: 1.5%
predictor[1]: 31.1%
predictor[2]: 10.7%
predictor[3]: 2.8%
predictor[4]: 1.8%
predictor[5]: 3.0%
predictor[6]: 0.8%
predictor[7]: 8.1%
predictor[8]: 0.5%
predictor[9]: 0.5%
predictor[10]: 1.9%
predictor[11]: 19.1%
predictor[12]: 12.0%
predictor[13]: 6.2%

1, 2, 7, 11, 12, and 13 are all over 5% for that corpus.
This PR improves upon 1 and 7.

Analysis of changes in this PR

I've added an assert for checking the range.end against the image data length to remove some saturating arithmetic instructions. This seems to slightly improve the benchmarks and lower noise on the otherwise-unchanged predictors.
Not sure if it's worth it or if we can remove the other saturating parts easily (the cmovb/a/ae sequences in the preludes).

Predictor 0

https://rust.godbolt.org/z/qarPb7aEo

I think this one is noisy but a slight improvement.
It removes the branchless ops in the prelude and a bounds check in the loop body.

Predictor 1

before vs. after
https://rust.godbolt.org/z/T88Kq66cG
Without vs. with assert for checking range.end
https://rust.godbolt.org/z/Ehzca9brj

before was an improvement of the previous while-loop because it operated on 4 bytes per iteration, while the previous while-loop version was bytewise (prior to 1.82).
The assert link shows the difference in some saturating ops for the loop prelude.

Predictor 5

https://rust.godbolt.org/z/MPP7qx98z

Auto-vectorization from calculating the average using & and ^.

Predictor 6

https://rust.godbolt.org/z/895n6hE8o

This is the only one I'm somewhat hesitant about.
It has a very long prelude, which is similar to the catastrophic failures when trying to rewrite 10-13, but the benchmark came out consistently faster. If we could chop down the prelude, it'd probably be even faster especially for shorter runs. It's listed as being 0.8% of predictors in the chart above.

test lossless_transform::benches::predictor06     ... bench:       1,519.71 ns/iter (+/- 50.47) = 671 MB/s
test lossless_transform::benches::predictor06     ... bench:         717.89 ns/iter (+/- 48.46) = 1422 MB/s

Even using the average2_autovec couldn't coax the current implementation into auto-vectorizing.

Predictor 7

https://rust.godbolt.org/z/89je9nsfe

Auto-vectorization from the bitwise-ops average.

Predictors 10-13 already auto-vectorize and use the bitwise trick for average2 if they call the function. I don't know why it doesn't recognize the idiom for the others. This happened with the png average function as well that we optimized.

okaneco · 2025-07-12T01:18:58Z

Not helpful until MSRV is 1.88, but it looks like that regression can be worked around with as_chunks, even unrolls it 4 times.
https://godbolt.org/z/sPjG4zWqK

Feel free to share that in the png or Rust issues if you think it's relevant.

fintelia · 2025-07-12T21:35:53Z

I see a ~1% end-to-end speedup from this change. At this point, roughly 20% of total decode time (of lossless images) is spent in the transforms. The rough breakdown is:

predictor11: 8.8% 
predictor12: 4.1%
predictor13: 1.4%
predictor1:  1.1%
predictor7:  0.8%
all others:  2.6%

okaneco · 2025-07-12T22:16:46Z

Makes sense that those predictors show up.
Those are the more involved ones that have to unpack to i16 and two of them do the clamp_add_subtract_full/half.

Not sure how much more we can tease out without dipping into intrinsics.

fintelia merged commit 55f7fff into image-rs:main Jul 12, 2025
7 checks passed

okaneco deleted the lossless_predictors branch July 12, 2025 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize predictors 0-9 in `lossless_transform` #152

Optimize predictors 0-9 in `lossless_transform` #152

Uh oh!

okaneco commented Jul 11, 2025

Uh oh!

fintelia commented Jul 11, 2025

Uh oh!

okaneco commented Jul 11, 2025 •

edited

Loading

Uh oh!

okaneco commented Jul 12, 2025 •

edited

Loading

Uh oh!

okaneco commented Jul 12, 2025

Uh oh!

Uh oh!

fintelia commented Jul 12, 2025

Uh oh!

okaneco commented Jul 12, 2025

Uh oh!

Uh oh!

Optimize predictors 0-9 in lossless_transform #152

Optimize predictors 0-9 in lossless_transform #152

Uh oh!

Conversation

okaneco commented Jul 11, 2025

Uh oh!

fintelia commented Jul 11, 2025

Uh oh!

okaneco commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

okaneco commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background context and reasoning for the previous predictor optimizations

Analysis of changes in this PR

Predictor 0

Predictor 1

Predictor 5

Predictor 6

Predictor 7

Uh oh!

okaneco commented Jul 12, 2025

Uh oh!

Uh oh!

fintelia commented Jul 12, 2025

Uh oh!

okaneco commented Jul 12, 2025

Uh oh!

Uh oh!

Optimize predictors 0-9 in `lossless_transform` #152

Optimize predictors 0-9 in `lossless_transform` #152

okaneco commented Jul 11, 2025 •

edited

Loading

okaneco commented Jul 12, 2025 •

edited

Loading