Cheaper compute_lowest_and_highest_weight #516

rygorous · 2024-11-09T00:41:02Z

The actual weights array is constant, so we can compute the min/max weights in a single pass ahead of time. The remapping we apply to the weights depending on rcp_stepsize and offset is monotonic; therefore we can figure out minidx and maxidx in advance from the min/max weights we determined previously.

This in turn means that we don't need to do the "reset" handling in the loop where we zero the accumulator whenever the current min/max changes. The amount of code in the inner loop nest is considerably reduced as a result, and the extra preprocessing outside that loop is generally less than the work it replaces.

The exact impact depends on the block size and max_angular_steps. Testing with RGB textures and -thorough quality, I observe approximately 0.7% coding time reduction with 4x4 blocks; 6x6 and larger see a 1.2% reduction.

At -fast quality, there is not much gain, but neither is it any slower.

In my tests, this change does not alter encoding results (nor should it); the output is bitwise identical to before.

The actual weights array is constant, so we can compute the min/max weights in a single pass ahead of time. The remapping we apply to the weights depending on rcp_stepsize and offset is monotonic; therefore we can figure out minidx and maxidx in advance from the min/max weights we determined previously. This in turn means that we don't need to do the "reset" handling in the loop where we zero the accumulator whenever the current min/max changes. The amount of code in the inner loop nest is considerably reduced as a result, and the extra preprocessing outside that loop is generally less than the work it replaces. The exact impact depends on the block size and max_angular_steps. Testing with RGB textures and -thorough quality, I observe approximately 0.7% coding time reduction with 4x4 blocks; 6x6 and larger see a 1.2% reduction. At -fast quality, there is not much gain, but neither is it any slower. In my tests, this change does not alter encoding results (nor should it); the output is bitwise identical to before. Signed-off-by: Fabian Giesen <[email protected]>

solidpixel · 2024-11-09T14:03:39Z

That's a nice one.

Intel i5 AVX2 performance:

4x4: 1.91 MT/s increases to 1.93 MT/s
6x6: 1.84 MT/s increases to 1.89 MT/s

solidpixel · 2024-11-09T14:13:15Z

Two follow on thoughts that this has triggered, which I'll investigate ...

Can we cache the min/max when we first write the weights array, and avoid the preamble search here?
Failing that, can we pad out the weights array to avoid the preamble loop tail? (I've been trying to avoid loop tails, on the hope I can get around to adding SVE predicated loops at some point).

rygorous · 2024-11-09T18:54:06Z

For 1., haven't looked at where the weights come from yet, possibly, however: 2. yes, quite cheaply. The idea would be to load1 the final weight and then store it back where we read it from. This needs nlanes-1 extra padding after the end of the weights array. Padding with extra copies of one of the weights changes neither the min nor max.

Or writing this down now, an alternative would be to read the entire final aligned group, select() against a load1-ed version of the final weight to put the cloned weights past the end of the real weights, and do an aligned store back. A few more instructions but everything aligned and no data layout changes. I can try that later.

solidpixel · 2024-11-09T21:17:08Z

Hmm, the padding at source isn't totally trivial, at least not without adding more work than it saves.

The storage is already padded and filled with zeros in compute_ideal_colors_and_weights_*().

We can easily change that to fill the weight tail with the last weight value, but we also need to replicate the last weight_error_scale value so that the weight in the tail moves the same as the last real weight under refinement. However, we also rely on that weight_error_scale for the tail being zero so it is not generating values in error accumulation.

Adding masking to either zero or replicate the last value are more expensive than just having the scalar tail as in the PR.

rygorous · 2024-11-09T21:22:45Z

Why? We only need the non-zero-padded weights right around the min/max loop. We can put back the original zero padding immediately after.

solidpixel · 2024-11-09T21:30:21Z

True. I was trying to do it once at source and avoid the two-step later.

Given that the min/max scan loop is only run once, it actually seems fine to just use a predicated scan. Testing here and this doesn't seem measurably slower on an i5.

	vint lane_id = vint::lane_id();
	for (unsigned int i = 0; i < weight_count; i += ASTCENC_SIMD_WIDTH)
	{
		vmask active = lane_id < vint(weight_count);
		lane_id += vint(ASTCENC_SIMD_WIDTH);

		vfloat weights = loada(dec_weight_ideal_value + i);
		min_weight = min(min_weight, select(min_weight, weights, active));
		max_weight = max(max_weight, select(max_weight, weights, active));
	}

rygorous force-pushed the main branch from 8484139 to 60fb573 Compare November 9, 2024 00:56

solidpixel self-requested a review November 9, 2024 14:05

Add to changelog

6b2de52

solidpixel approved these changes Nov 9, 2024

View reviewed changes

solidpixel merged commit 33082c2 into ARM-software:main Nov 9, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cheaper compute_lowest_and_highest_weight #516

Cheaper compute_lowest_and_highest_weight #516

rygorous commented Nov 9, 2024

solidpixel commented Nov 9, 2024

solidpixel commented Nov 9, 2024 •

edited

Loading

rygorous commented Nov 9, 2024 •

edited

Loading

solidpixel commented Nov 9, 2024 •

edited

Loading

rygorous commented Nov 9, 2024

solidpixel commented Nov 9, 2024

Cheaper compute_lowest_and_highest_weight #516

Cheaper compute_lowest_and_highest_weight #516

Conversation

rygorous commented Nov 9, 2024

solidpixel commented Nov 9, 2024

solidpixel commented Nov 9, 2024 • edited Loading

rygorous commented Nov 9, 2024 • edited Loading

solidpixel commented Nov 9, 2024 • edited Loading

rygorous commented Nov 9, 2024

solidpixel commented Nov 9, 2024

solidpixel commented Nov 9, 2024 •

edited

Loading

rygorous commented Nov 9, 2024 •

edited

Loading

solidpixel commented Nov 9, 2024 •

edited

Loading