Why does TensorPrimitives performance regress for large spans if the destination is one of the inputs #119695

KPHIBYE · 2025-09-14T15:29:59Z

KPHIBYE
Sep 14, 2025

I ran benchmarks to compare the BitwiseOr method of the TensorPrimitives class and the pre-.NET 10 implementation of the BitArray class because I was wondering if all the extra code in TensorPrimitives resulted in actual performance benefits compared to the relatively simple and easy to understand implementation in the old BitArray. The benchmark code and results are as follows:

using System;
using System.Numerics.Tensors;
using System.Runtime.Intrinsics;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;

namespace Benchmarks;

public class OldBitArrayVsTensorPrimitivesBenchmarks
{
    [Params(16_000, 32_000, 64_000, 128_000)]
    public int BufferLength;

    public int[] X;
    public int[] Y;
    public int[] Destination;

    [GlobalSetup]
    public void GlobalSetup()
    {
        X = new int[BufferLength];
        Y = new int[BufferLength];
        Destination = new int[BufferLength];

        for (int i = 0; i < BufferLength; i++)
        {
            X[i] = 1;
            Y[i] = 2;
        }
    }

    [Benchmark(Baseline = true)]
    public void ClassicBitArrayToX()
    {
        ClassicBitArray(X, Y, X.AsSpan());
    }

    [Benchmark]
    public void ClassicBitArrayToDest()
    {
        ClassicBitArray(X, Y, Destination.AsSpan());
    }

    [Benchmark]
    public void TensorPrimitivesToX()
    {
        TensorPrimitives.BitwiseOr(X, Y, X.AsSpan());
    }

    [Benchmark]
    public void TensorPrimitivesToDest()
    {
        TensorPrimitives.BitwiseOr(X, Y, Destination.AsSpan());
    }

    public static void ClassicBitArray(ReadOnlySpan<int> x, ReadOnlySpan<int> y, Span<int> destination)
    {
        uint count = (uint)x.Length;
        if (x.Length != y.Length || y.Length > destination.Length)
            throw new ArgumentException();

        switch (count)
        {
            case 7: destination[6] = x[6] | y[6]; goto case 6;
            case 6: destination[5] = x[5] | y[5]; goto case 5;
            case 5: destination[4] = x[4] | y[4]; goto case 4;
            case 4: destination[3] = x[3] | y[3]; goto case 3;
            case 3: destination[2] = x[2] | y[2]; goto case 2;
            case 2: destination[1] = x[1] | y[1]; goto case 1;
            case 1: destination[0] = x[0] | y[0]; return;
            case 0: return;
        }

        uint i = 0;

        ref int left = ref MemoryMarshal.GetReference(x);
        ref int right = ref MemoryMarshal.GetReference(y);
        ref int dest = ref MemoryMarshal.GetReference(destination);

        if (Vector512.IsHardwareAccelerated && count >= Vector512<int>.Count)
        {
            for (; i < count - (Vector512<int>.Count - 1u); i += (uint)Vector512<int>.Count)
            {
                Vector512<int> result = Vector512.LoadUnsafe(ref left, i) | Vector512.LoadUnsafe(ref right, i);
                result.StoreUnsafe(ref dest, i);
            }
        }
        else if (Vector256.IsHardwareAccelerated && count >= Vector256<int>.Count)
        {
            for (; i < count - (Vector256<int>.Count - 1u); i += (uint)Vector256<int>.Count)
            {
                Vector256<int> result = Vector256.LoadUnsafe(ref left, i) | Vector256.LoadUnsafe(ref right, i);
                result.StoreUnsafe(ref dest, i);
            }
        }
        else if (Vector128.IsHardwareAccelerated && count >= Vector128<int>.Count)
        {
            for (; i < count - (Vector128<int>.Count - 1u); i += (uint)Vector128<int>.Count)
            {
                Vector128<int> result = Vector128.LoadUnsafe(ref left, i) | Vector128.LoadUnsafe(ref right, i);
                result.StoreUnsafe(ref dest, i);
            }
        }

        for (; i < count; i++)
            destination[(int)i] = x[(int)i] | y[(int)i];
    }
}

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.6093/22H2/2022Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.200
  [Host]     : .NET 9.0.2 (9.0.225.6610), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.2 (9.0.225.6610), X64 RyuJIT AVX2


| Method                 | BufferLength | Mean      | Error     | StdDev    | Ratio | RatioSD |
|----------------------- |------------- |----------:|----------:|----------:|------:|--------:|
| ClassicBitArrayToX     | 16000        |  2.232 us | 0.0245 us | 0.0205 us |  1.00 |    0.01 |
| ClassicBitArrayToDest  | 16000        |  2.389 us | 0.0397 us | 0.0331 us |  1.07 |    0.02 |
| TensorPrimitivesToX    | 16000        |  1.861 us | 0.0330 us | 0.0339 us |  0.83 |    0.02 |
| TensorPrimitivesToDest | 16000        |  2.250 us | 0.0433 us | 0.0426 us |  1.01 |    0.02 |
|                        |              |           |           |           |       |         |
| ClassicBitArrayToX     | 32000        |  4.361 us | 0.0477 us | 0.0423 us |  1.00 |    0.01 |
| ClassicBitArrayToDest  | 32000        |  7.230 us | 0.1408 us | 0.1383 us |  1.66 |    0.03 |
| TensorPrimitivesToX    | 32000        |  3.722 us | 0.0695 us | 0.0616 us |  0.85 |    0.02 |
| TensorPrimitivesToDest | 32000        |  7.125 us | 0.1416 us | 0.1255 us |  1.63 |    0.03 |
|                        |              |           |           |           |       |         |
| ClassicBitArrayToX     | 64000        | 11.550 us | 0.1663 us | 0.1555 us |  1.00 |    0.02 |
| ClassicBitArrayToDest  | 64000        | 15.528 us | 0.2964 us | 0.3044 us |  1.34 |    0.03 |
| TensorPrimitivesToX    | 64000        | 11.070 us | 0.1237 us | 0.0966 us |  0.96 |    0.01 |
| TensorPrimitivesToDest | 64000        | 15.323 us | 0.2945 us | 0.3506 us |  1.33 |    0.03 |
|                        |              |           |           |           |       |         |
| ClassicBitArrayToX     | 128000       | 24.247 us | 0.4704 us | 0.5417 us |  1.00 |    0.03 |
| ClassicBitArrayToDest  | 128000       | 32.794 us | 0.6362 us | 0.7071 us |  1.35 |    0.04 |
| TensorPrimitivesToX    | 128000       | 50.928 us | 0.7257 us | 0.6788 us |  2.10 |    0.05 |
| TensorPrimitivesToDest | 128000       | 29.876 us | 0.5880 us | 0.7850 us |  1.23 |    0.04 |

It is nice to see that the TensorPrimitives method outperforms the old BitArray code if you store the result back into one of the inputs and that it is equally performant if the destination is an entirely different memory region, but why is the performance suddenly significantly worse when storing the results back into an input, if the spans are sufficiently large (TensorPrimitivesToX | 128000)?

The source code contains the following comment that could help to explain the phenomenon:

Pinning is cheap and will be short lived for small inputs and unlikely to be impactful for large inputs (> 85KB) which are on the LOH and unlikely to be compacted.

I am thankful for all answers and everything new that I am able to learn 🙏

Answered by tannergooding

Sep 14, 2025

At a certain point (256kb right now, so anything more than 64k int/uint) we start using non-temporal stores because it is more beneficial to real world scenarios. However, this comes with a tradeoff in that it can make microbenchmarks look worse.

        ///     A non-temporal store is one that allows the CPU to bypass the cache when writing to memory.
        ///
        ///     This can be beneficial when working with large amounts of memory where the writes would otherwise
        ///     cause large amounts of repeated updates and evictions. The hardware optimization manuals recommend
        ///     the threshold to be roughly half the size of the last level of on-die cache -- that i…

View full answer

tannergooding · 2025-09-14T18:25:18Z

tannergooding
Sep 14, 2025
Maintainer

At a certain point (256kb right now, so anything more than 64k int/uint) we start using non-temporal stores because it is more beneficial to real world scenarios. However, this comes with a tradeoff in that it can make microbenchmarks look worse.

        ///     A non-temporal store is one that allows the CPU to bypass the cache when writing to memory.
        ///
        ///     This can be beneficial when working with large amounts of memory where the writes would otherwise
        ///     cause large amounts of repeated updates and evictions. The hardware optimization manuals recommend
        ///     the threshold to be roughly half the size of the last level of on-die cache -- that is, if you have approximately
        ///     4MB of L3 cache per core, you'd want this to be approx. 1-2MB, depending on if hyperthreading was enabled.
        ///
        ///     However, actually computing the amount of L3 cache per core can be tricky or error prone. Native memcpy
        ///     algorithms use a constant threshold that is typically around 256KB and we match that here for simplicity. This
        ///     threshold accounts for most processors in the last 10-15 years that had approx. 1MB L3 per core and support
        ///     hyperthreading, giving a per core last level cache of approx. 512KB.

The general consideration is that microbenchmarks are effectively doing nothing but running the same code over and over, so they end up benefiting greatly from caching and branch prediction, often giving results that may not line up with real world usage.

A typical real world application will be running on the machine with many other processes and services going at the "same time". They will likely be running other application logic, touching a wider range of data, etc. This means that you typically won't get the "optimal" results where the branch predictor is perfectly trained for the loop or you're likely to end up in a scenario where you have more regular cache misses due to other data needing to fit into the cache causing your data to be evicted.

Non-temporal stores are used (and recommended by the architecture manuals) when you know you're working with data that is roughly larger than 50% of the available L3 cache space per core. This is because you're touching "enough" data that you're going to effectively evict all other data from the cache and pessimize the rest of the system. So the tradeoff is to be a "good" citizen and keep the entire system running smoothly by allowing your own code path to run a little slower (which in effect makes the end to end app run faster).

The ToDest case doesn't pessimize as badly because it's touching 33% less memory than the ToX path, so it better highlights the types of wins you can see in a real world scenario.

1 reply

KPHIBYE Sep 14, 2025
Author

Thank you for your detailed answer. I was unaware that non-temporal operations existed and believed that loads and stores are always performed through the cache. This design decision makes a lot of sense in hindsight. Unfortunately, I don't understand your last sentence. How is the ToDest case touching 33% less memory than the ToX path? Isn't it irrelevant to which memory address the result vectors are written?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does TensorPrimitives performance regress for large spans if the destination is one of the inputs #119695

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does TensorPrimitives performance regress for large spans if the destination is one of the inputs #119695

Uh oh!

KPHIBYE Sep 14, 2025

Replies: 1 comment · 1 reply

Uh oh!

tannergooding Sep 14, 2025 Maintainer

Uh oh!

KPHIBYE Sep 14, 2025 Author

KPHIBYE
Sep 14, 2025

Replies: 1 comment 1 reply

tannergooding
Sep 14, 2025
Maintainer

KPHIBYE Sep 14, 2025
Author