Skip to content

Optimize predictors 0-9 in lossless_transform #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 12, 2025

Conversation

okaneco
Copy link
Contributor

@okaneco okaneco commented Jul 11, 2025

  • Rewrite predictor transform 1 to be auto-vectorized.
  • Add an average function which performs average with bitwise operations.
    • LLVM is capable of doing this optimization and does it for the average2 calls which are left unchanged, but it's not able to automatically perform it for every transform we have.
  • Add asserts to remove some saturating arithmetic instructions.

Predictors 10-13 were much slower when rewritten following this pattern.

Related
#95, #96


Rust 1.82 (2024-10-17) upgraded to LLVM 19 which auto-vectorizes the old while constructions. This allowed for rewriting predictor 1 to be 3x faster.

1, 5, 6, and 7 are the decisive improvements in the benchmarks.

before


test lossless_transform::benches::predictor00     ... bench:          98.81 ns/iter (+/- 2.49) = 10408 MB/s
test lossless_transform::benches::predictor01     ... bench:         308.75 ns/iter (+/- 19.58) = 3311 MB/s
test lossless_transform::benches::predictor02     ... bench:          42.42 ns/iter (+/- 1.10) = 24285 MB/s
test lossless_transform::benches::predictor03     ... bench:          41.83 ns/iter (+/- 1.23) = 24878 MB/s
test lossless_transform::benches::predictor04     ... bench:          39.81 ns/iter (+/- 3.22) = 26153 MB/s
test lossless_transform::benches::predictor05     ... bench:         711.89 ns/iter (+/- 17.23) = 1434 MB/s
test lossless_transform::benches::predictor06     ... bench:       1,519.71 ns/iter (+/- 50.47) = 671 MB/s
test lossless_transform::benches::predictor07     ... bench:         518.31 ns/iter (+/- 4.46) = 1969 MB/s
test lossless_transform::benches::predictor08     ... bench:          70.32 ns/iter (+/- 8.86) = 14571 MB/s
test lossless_transform::benches::predictor09     ... bench:          65.92 ns/iter (+/- 1.98) = 15692 MB/s

after


test lossless_transform::benches::predictor00     ... bench:          91.49 ns/iter (+/- 1.01) = 11208 MB/s
test lossless_transform::benches::predictor01     ... bench:          90.69 ns/iter (+/- 2.70) = 11333 MB/s
test lossless_transform::benches::predictor02     ... bench:          36.57 ns/iter (+/- 1.10) = 28333 MB/s
test lossless_transform::benches::predictor03     ... bench:          34.14 ns/iter (+/- 5.56) = 30000 MB/s
test lossless_transform::benches::predictor04     ... bench:          35.98 ns/iter (+/- 2.84) = 29142 MB/s
test lossless_transform::benches::predictor05     ... bench:         466.55 ns/iter (+/- 29.46) = 2188 MB/s
test lossless_transform::benches::predictor06     ... bench:         717.89 ns/iter (+/- 48.46) = 1422 MB/s
test lossless_transform::benches::predictor07     ... bench:         268.98 ns/iter (+/- 9.90) = 3805 MB/s
test lossless_transform::benches::predictor08     ... bench:          63.66 ns/iter (+/- 1.51) = 16190 MB/s
test lossless_transform::benches::predictor09     ... bench:          61.02 ns/iter (+/- 7.13) = 16721 MB/s

Rewrite predictor transform 1 to be autovectorized.
Add an average function which performs average with bitwise operations.
LLVM is capable of doing this optimization but it's not able to
automatically perform it for every transform we have.
Add asserts to remove some saturating arithmetic instructions.

Predictors 10-13 were much slower when rewritten following
this pattern.
@fintelia
Copy link
Contributor

I suspect this PR is working around around this regression: rust-lang/rust#142519

@okaneco
Copy link
Contributor Author

okaneco commented Jul 11, 2025

Interesting 🤔

I'm working on a longer analysis for the major improvements to post here, but for predictor 1, it doesn't appear like that ever auto-vectorized (with the chunk form)?
https://rust.godbolt.org/z/T88Kq66cG

@okaneco
Copy link
Contributor Author

okaneco commented Jul 12, 2025

Background context and reasoning for the previous predictor optimizations

Reproducing the chart from #94,

Commonly used predictors by libwebp (determined by the number of pixels using each predictor across a corpus encoded by cwebp):

predictor[0]: 1.5%
predictor[1]: 31.1%
predictor[2]: 10.7%
predictor[3]: 2.8%
predictor[4]: 1.8%
predictor[5]: 3.0%
predictor[6]: 0.8%
predictor[7]: 8.1%
predictor[8]: 0.5%
predictor[9]: 0.5%
predictor[10]: 1.9%
predictor[11]: 19.1%
predictor[12]: 12.0%
predictor[13]: 6.2%

1, 2, 7, 11, 12, and 13 are all over 5% for that corpus.
This PR improves upon 1 and 7.

Analysis of changes in this PR

I've added an assert for checking the range.end against the image data length to remove some saturating arithmetic instructions. This seems to slightly improve the benchmarks and lower noise on the otherwise-unchanged predictors.
Not sure if it's worth it or if we can remove the other saturating parts easily (the cmovb/a/ae sequences in the preludes).

Predictor 0

https://rust.godbolt.org/z/qarPb7aEo

I think this one is noisy but a slight improvement.
It removes the branchless ops in the prelude and a bounds check in the loop body.

Predictor 1

before vs. after
https://rust.godbolt.org/z/T88Kq66cG
Without vs. with assert for checking range.end
https://rust.godbolt.org/z/Ehzca9brj

before was an improvement of the previous while-loop because it operated on 4 bytes per iteration, while the previous while-loop version was bytewise (prior to 1.82).
The assert link shows the difference in some saturating ops for the loop prelude.

Predictor 5

https://rust.godbolt.org/z/MPP7qx98z

Auto-vectorization from calculating the average using & and ^.

Predictor 6

https://rust.godbolt.org/z/895n6hE8o

This is the only one I'm somewhat hesitant about.
It has a very long prelude, which is similar to the catastrophic failures when trying to rewrite 10-13, but the benchmark came out consistently faster. If we could chop down the prelude, it'd probably be even faster especially for shorter runs. It's listed as being 0.8% of predictors in the chart above.

test lossless_transform::benches::predictor06     ... bench:       1,519.71 ns/iter (+/- 50.47) = 671 MB/s
test lossless_transform::benches::predictor06     ... bench:         717.89 ns/iter (+/- 48.46) = 1422 MB/s

Even using the average2_autovec couldn't coax the current implementation into auto-vectorizing.

Predictor 7

https://rust.godbolt.org/z/89je9nsfe

Auto-vectorization from the bitwise-ops average.


Predictors 10-13 already auto-vectorize and use the bitwise trick for average2 if they call the function. I don't know why it doesn't recognize the idiom for the others. This happened with the png average function as well that we optimized.

@okaneco
Copy link
Contributor Author

okaneco commented Jul 12, 2025

Not helpful until MSRV is 1.88, but it looks like that regression can be worked around with as_chunks, even unrolls it 4 times.
https://godbolt.org/z/sPjG4zWqK

Feel free to share that in the png or Rust issues if you think it's relevant.

@fintelia fintelia merged commit 55f7fff into image-rs:main Jul 12, 2025
7 checks passed
@fintelia
Copy link
Contributor

I see a ~1% end-to-end speedup from this change. At this point, roughly 20% of total decode time (of lossless images) is spent in the transforms. The rough breakdown is:

predictor11: 8.8% 
predictor12: 4.1%
predictor13: 1.4%
predictor1:  1.1%
predictor7:  0.8%
all others:  2.6%

@okaneco okaneco deleted the lossless_predictors branch July 12, 2025 21:48
@okaneco
Copy link
Contributor Author

okaneco commented Jul 12, 2025

Makes sense that those predictors show up.
Those are the more involved ones that have to unpack to i16 and two of them do the clamp_add_subtract_full/half.

Not sure how much more we can tease out without dipping into intrinsics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants