Why does TensorPrimitives performance regress for large spans if the destination is one of the inputs #119695
-
|
I ran benchmarks to compare the using System;
using System.Numerics.Tensors;
using System.Runtime.Intrinsics;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
namespace Benchmarks;
public class OldBitArrayVsTensorPrimitivesBenchmarks
{
[Params(16_000, 32_000, 64_000, 128_000)]
public int BufferLength;
public int[] X;
public int[] Y;
public int[] Destination;
[GlobalSetup]
public void GlobalSetup()
{
X = new int[BufferLength];
Y = new int[BufferLength];
Destination = new int[BufferLength];
for (int i = 0; i < BufferLength; i++)
{
X[i] = 1;
Y[i] = 2;
}
}
[Benchmark(Baseline = true)]
public void ClassicBitArrayToX()
{
ClassicBitArray(X, Y, X.AsSpan());
}
[Benchmark]
public void ClassicBitArrayToDest()
{
ClassicBitArray(X, Y, Destination.AsSpan());
}
[Benchmark]
public void TensorPrimitivesToX()
{
TensorPrimitives.BitwiseOr(X, Y, X.AsSpan());
}
[Benchmark]
public void TensorPrimitivesToDest()
{
TensorPrimitives.BitwiseOr(X, Y, Destination.AsSpan());
}
public static void ClassicBitArray(ReadOnlySpan<int> x, ReadOnlySpan<int> y, Span<int> destination)
{
uint count = (uint)x.Length;
if (x.Length != y.Length || y.Length > destination.Length)
throw new ArgumentException();
switch (count)
{
case 7: destination[6] = x[6] | y[6]; goto case 6;
case 6: destination[5] = x[5] | y[5]; goto case 5;
case 5: destination[4] = x[4] | y[4]; goto case 4;
case 4: destination[3] = x[3] | y[3]; goto case 3;
case 3: destination[2] = x[2] | y[2]; goto case 2;
case 2: destination[1] = x[1] | y[1]; goto case 1;
case 1: destination[0] = x[0] | y[0]; return;
case 0: return;
}
uint i = 0;
ref int left = ref MemoryMarshal.GetReference(x);
ref int right = ref MemoryMarshal.GetReference(y);
ref int dest = ref MemoryMarshal.GetReference(destination);
if (Vector512.IsHardwareAccelerated && count >= Vector512<int>.Count)
{
for (; i < count - (Vector512<int>.Count - 1u); i += (uint)Vector512<int>.Count)
{
Vector512<int> result = Vector512.LoadUnsafe(ref left, i) | Vector512.LoadUnsafe(ref right, i);
result.StoreUnsafe(ref dest, i);
}
}
else if (Vector256.IsHardwareAccelerated && count >= Vector256<int>.Count)
{
for (; i < count - (Vector256<int>.Count - 1u); i += (uint)Vector256<int>.Count)
{
Vector256<int> result = Vector256.LoadUnsafe(ref left, i) | Vector256.LoadUnsafe(ref right, i);
result.StoreUnsafe(ref dest, i);
}
}
else if (Vector128.IsHardwareAccelerated && count >= Vector128<int>.Count)
{
for (; i < count - (Vector128<int>.Count - 1u); i += (uint)Vector128<int>.Count)
{
Vector128<int> result = Vector128.LoadUnsafe(ref left, i) | Vector128.LoadUnsafe(ref right, i);
result.StoreUnsafe(ref dest, i);
}
}
for (; i < count; i++)
destination[(int)i] = x[(int)i] | y[(int)i];
}
}It is nice to see that the The source code contains the following comment that could help to explain the phenomenon:
I am thankful for all answers and everything new that I am able to learn 🙏 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
At a certain point (256kb right now, so anything more than 64k /// A non-temporal store is one that allows the CPU to bypass the cache when writing to memory.
///
/// This can be beneficial when working with large amounts of memory where the writes would otherwise
/// cause large amounts of repeated updates and evictions. The hardware optimization manuals recommend
/// the threshold to be roughly half the size of the last level of on-die cache -- that is, if you have approximately
/// 4MB of L3 cache per core, you'd want this to be approx. 1-2MB, depending on if hyperthreading was enabled.
///
/// However, actually computing the amount of L3 cache per core can be tricky or error prone. Native memcpy
/// algorithms use a constant threshold that is typically around 256KB and we match that here for simplicity. This
/// threshold accounts for most processors in the last 10-15 years that had approx. 1MB L3 per core and support
/// hyperthreading, giving a per core last level cache of approx. 512KB.The general consideration is that microbenchmarks are effectively doing nothing but running the same code over and over, so they end up benefiting greatly from caching and branch prediction, often giving results that may not line up with real world usage. A typical real world application will be running on the machine with many other processes and services going at the "same time". They will likely be running other application logic, touching a wider range of data, etc. This means that you typically won't get the "optimal" results where the branch predictor is perfectly trained for the loop or you're likely to end up in a scenario where you have more regular cache misses due to other data needing to fit into the cache causing your data to be evicted. Non-temporal stores are used (and recommended by the architecture manuals) when you know you're working with data that is roughly larger than 50% of the available L3 cache space per core. This is because you're touching "enough" data that you're going to effectively evict all other data from the cache and pessimize the rest of the system. So the tradeoff is to be a "good" citizen and keep the entire system running smoothly by allowing your own code path to run a little slower (which in effect makes the end to end app run faster). The |
Beta Was this translation helpful? Give feedback.
At a certain point (256kb right now, so anything more than 64k
int/uint) we start usingnon-temporalstores because it is more beneficial to real world scenarios. However, this comes with a tradeoff in that it can make microbenchmarks look worse.