Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cheaper compute_lowest_and_highest_weight #516

Merged
merged 2 commits into from
Nov 9, 2024

Conversation

rygorous
Copy link
Contributor

@rygorous rygorous commented Nov 9, 2024

The actual weights array is constant, so we can compute the min/max weights in a single pass ahead of time. The remapping we apply to the weights depending on rcp_stepsize and offset is monotonic; therefore we can figure out minidx and maxidx in advance from the min/max weights we determined previously.

This in turn means that we don't need to do the "reset" handling in the loop where we zero the accumulator whenever the current min/max changes. The amount of code in the inner loop nest is considerably reduced as a result, and the extra preprocessing outside that loop is generally less than the work it replaces.

The exact impact depends on the block size and max_angular_steps. Testing with RGB textures and -thorough quality, I observe approximately 0.7% coding time reduction with 4x4 blocks; 6x6 and larger see a 1.2% reduction.

At -fast quality, there is not much gain, but neither is it any slower.

In my tests, this change does not alter encoding results (nor should it); the output is bitwise identical to before.

The actual weights array is constant, so we can compute the
min/max weights in a single pass ahead of time. The remapping
we apply to the weights depending on rcp_stepsize and offset
is monotonic; therefore we can figure out minidx and maxidx in
advance from the min/max weights we determined previously.

This in turn means that we don't need to do the "reset" handling
in the loop where we zero the accumulator whenever the current
min/max changes. The amount of code in the inner loop nest is
considerably reduced as a result, and the extra preprocessing
outside that loop is generally less than the work it replaces.

The exact impact depends on the block size and max_angular_steps.
Testing with RGB textures and -thorough quality, I observe
approximately 0.7% coding time reduction with 4x4 blocks; 6x6
and larger see a 1.2% reduction.

At -fast quality, there is not much gain, but neither is it
any slower.

In my tests, this change does not alter encoding results (nor
should it); the output is bitwise identical to before.

Signed-off-by: Fabian Giesen <[email protected]>
@solidpixel
Copy link
Contributor

That's a nice one.

Intel i5 AVX2 performance:

  • 4x4: 1.91 MT/s increases to 1.93 MT/s
  • 6x6: 1.84 MT/s increases to 1.89 MT/s

@solidpixel solidpixel self-requested a review November 9, 2024 14:05
@solidpixel
Copy link
Contributor

solidpixel commented Nov 9, 2024

Two follow on thoughts that this has triggered, which I'll investigate ...

  1. Can we cache the min/max when we first write the weights array, and avoid the preamble search here?
  2. Failing that, can we pad out the weights array to avoid the preamble loop tail? (I've been trying to avoid loop tails, on the hope I can get around to adding SVE predicated loops at some point).

@solidpixel solidpixel merged commit 33082c2 into ARM-software:main Nov 9, 2024
7 checks passed
@rygorous
Copy link
Contributor Author

rygorous commented Nov 9, 2024

For 1., haven't looked at where the weights come from yet, possibly, however: 2. yes, quite cheaply. The idea would be to load1 the final weight and then store it back where we read it from. This needs nlanes-1 extra padding after the end of the weights array. Padding with extra copies of one of the weights changes neither the min nor max.

Or writing this down now, an alternative would be to read the entire final aligned group, select() against a load1-ed version of the final weight to put the cloned weights past the end of the real weights, and do an aligned store back. A few more instructions but everything aligned and no data layout changes. I can try that later.

@solidpixel
Copy link
Contributor

solidpixel commented Nov 9, 2024

Hmm, the padding at source isn't totally trivial, at least not without adding more work than it saves.

The storage is already padded and filled with zeros in compute_ideal_colors_and_weights_*().

We can easily change that to fill the weight tail with the last weight value, but we also need to replicate the last weight_error_scale value so that the weight in the tail moves the same as the last real weight under refinement. However, we also rely on that weight_error_scale for the tail being zero so it is not generating values in error accumulation.

Adding masking to either zero or replicate the last value are more expensive than just having the scalar tail as in the PR.

@rygorous
Copy link
Contributor Author

rygorous commented Nov 9, 2024

Why? We only need the non-zero-padded weights right around the min/max loop. We can put back the original zero padding immediately after.

@solidpixel
Copy link
Contributor

True. I was trying to do it once at source and avoid the two-step later.

Given that the min/max scan loop is only run once, it actually seems fine to just use a predicated scan. Testing here and this doesn't seem measurably slower on an i5.

	vint lane_id = vint::lane_id();
	for (unsigned int i = 0; i < weight_count; i += ASTCENC_SIMD_WIDTH)
	{
		vmask active = lane_id < vint(weight_count);
		lane_id += vint(ASTCENC_SIMD_WIDTH);

		vfloat weights = loada(dec_weight_ideal_value + i);
		min_weight = min(min_weight, select(min_weight, weights, active));
		max_weight = max(max_weight, select(max_weight, weights, active));
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants