-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cheaper compute_lowest_and_highest_weight #516
Conversation
The actual weights array is constant, so we can compute the min/max weights in a single pass ahead of time. The remapping we apply to the weights depending on rcp_stepsize and offset is monotonic; therefore we can figure out minidx and maxidx in advance from the min/max weights we determined previously. This in turn means that we don't need to do the "reset" handling in the loop where we zero the accumulator whenever the current min/max changes. The amount of code in the inner loop nest is considerably reduced as a result, and the extra preprocessing outside that loop is generally less than the work it replaces. The exact impact depends on the block size and max_angular_steps. Testing with RGB textures and -thorough quality, I observe approximately 0.7% coding time reduction with 4x4 blocks; 6x6 and larger see a 1.2% reduction. At -fast quality, there is not much gain, but neither is it any slower. In my tests, this change does not alter encoding results (nor should it); the output is bitwise identical to before. Signed-off-by: Fabian Giesen <[email protected]>
That's a nice one. Intel i5 AVX2 performance:
|
Two follow on thoughts that this has triggered, which I'll investigate ...
|
For 1., haven't looked at where the weights come from yet, possibly, however: 2. yes, quite cheaply. The idea would be to load1 the final weight and then store it back where we read it from. This needs nlanes-1 extra padding after the end of the weights array. Padding with extra copies of one of the weights changes neither the min nor max. Or writing this down now, an alternative would be to read the entire final aligned group, select() against a load1-ed version of the final weight to put the cloned weights past the end of the real weights, and do an aligned store back. A few more instructions but everything aligned and no data layout changes. I can try that later. |
Hmm, the padding at source isn't totally trivial, at least not without adding more work than it saves. The storage is already padded and filled with zeros in We can easily change that to fill the weight tail with the last weight value, but we also need to replicate the last Adding masking to either zero or replicate the last value are more expensive than just having the scalar tail as in the PR. |
Why? We only need the non-zero-padded weights right around the min/max loop. We can put back the original zero padding immediately after. |
True. I was trying to do it once at source and avoid the two-step later. Given that the min/max scan loop is only run once, it actually seems fine to just use a predicated scan. Testing here and this doesn't seem measurably slower on an i5.
|
The actual weights array is constant, so we can compute the min/max weights in a single pass ahead of time. The remapping we apply to the weights depending on rcp_stepsize and offset is monotonic; therefore we can figure out minidx and maxidx in advance from the min/max weights we determined previously.
This in turn means that we don't need to do the "reset" handling in the loop where we zero the accumulator whenever the current min/max changes. The amount of code in the inner loop nest is considerably reduced as a result, and the extra preprocessing outside that loop is generally less than the work it replaces.
The exact impact depends on the block size and max_angular_steps. Testing with RGB textures and -thorough quality, I observe approximately 0.7% coding time reduction with 4x4 blocks; 6x6 and larger see a 1.2% reduction.
At -fast quality, there is not much gain, but neither is it any slower.
In my tests, this change does not alter encoding results (nor should it); the output is bitwise identical to before.