-
-
Notifications
You must be signed in to change notification settings - Fork 200
String Hashing
For MSBuild BinaryLogger we decided to go with this one, which is very good and super simple: https://github.com/dotnet/msbuild/blob/e1de0c476fac967bd1f1d4c17e5c8a70513cfdee/src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs#L1218-L1246
Microsoft.NET.StringTools.dll (part of MSBuild) chose djb2: https://github.com/dotnet/msbuild/blob/e1de0c476fac967bd1f1d4c17e5c8a70513cfdee/src/StringTools/InternableString.cs#L310-L397
Looking at deduplicating strings in an MSBuild binlog. Can't hold on to strings to catch collisions so need to rely on hash function not having collisions. Datasets include binlogs with up to 2.7 million strings. Different binlogs have different distributions on string sizes, so hash function behavior will be very different on similar-sized binlogs.
I ran the tests on .NET Framework 64-bit. MSBuild is mostly 32-bit but I couldn't fit the large binlogs in a 32-bit process so had to run the test in 64-bit. The micro-benchmarks ran in .NET Framework 32-bit. You can run on .NET 5 and I'm sure you'll get different numbers but I needed this for MSBuild.
The binlog dataset is not publicly available, only the micro-benchmark is public.
Spoiler alert: I'm going with FNV-1a 64-bit hash size and an "optimization": https://github.com/KirillOsenkov/Benchmarks/blob/9cf36b1bd72a6768b5cb4d3a7838571e88bd1e4b/Benchmarks/Tests/StringHash.Fnv.cs#L54-L72
- benchmarks and source code: https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.cs
- https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.Fnv.cs
- https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.Marvin.cs (Thanks to https://twitter.com/ben_a_adams)
- https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.djb2.cs
- https://github.com/uranium62/xxHash/tree/master/src/Standart.Hash.xxHash
I've evaluated 32-bit and 64-bit hash sizes. All functions tested are roughly equivalent in terms of collisions and provide a good even distribution. The Birthday Paradox roughly estimates the number of collisions per dataset size. I've found that 32-bit hash size is not enough for the datasets. All hash functions start having collisions between 30,000-100,000 strings. For largest binlogs at 2.7 million strings all 32-bit functions have ~800 collisions which is too much. Having a collision would mean the user would be shown a completely unrelated random string in a random location and this is very confusing.
Good news is that 64-bit hash size had 0 collisions for all binlogs I tested. I don't know how much larger the datasets have to get to start hitting 64-bit collisions. But good news is that collisions are unlikely (but possible!) for the current size.
Going with a 64-bit hash size.
The results vary heavily based on the binlog. xxHash32 and Marvin compete for the first place on large binlogs. FNV and others behaved well. I'd say xxHash is the overall winner in terms of perf.
Results for the largest binlog:
Strings: 2,744,490
Fnv1a64 Collisions: 0 Elapsed: 00:00:10.7489566
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:11.0249801
xxHash64 Collisions: 0 Elapsed: 00:00:11.5102876
Murmur3-128 Collisions: 0 Elapsed: 00:00:12.8930213
Fnv1a32Fast Collisions: 836 Elapsed: 00:00:08.2846016
xxHash32 Collisions: 913 Elapsed: 00:00:09.5995559
Murmur3-32 Collisions: 905 Elapsed: 00:00:09.6880563
djb2 Collisions: 943 Elapsed: 00:00:10.3829440
Fnv1a32 Collisions: 920 Elapsed: 00:00:11.8845191
GetHashCode Collisions: 880 Elapsed: 00:00:12.1517814
Marvin Collisions: 874 Elapsed: 00:00:12.7311613
When hashing a string the canonical FNV-1a dictates that each octet needs to be mixed in separately. So for a .NET char you need to do it twice:
char ch = text[i];
byte b = (byte)ch;
hash ^= b;
hash *= Prime;
b = (byte)(ch >> 8);
hash ^= b;
hash *= Prime;
However I decided to see if I can get away with mixing in the entire char in one fell swoop:
char ch = text[i];
hash = (hash ^ ch) * Prime;
and the 64-bit hash still had no collisions and performance improved roughly 2x, so going to go with this optimization. This optimization for 32-bit did result in more collisions than the canonical version in some situations:
Fnv1a32Fast Collisions: 15 Elapsed: 00:00:00.9699976
Fnv1a32 Collisions: 8 Elapsed: 00:00:01.7843502
but not in others:
Fnv1a32Fast Collisions: 9 Elapsed: 00:00:00.2443658
Fnv1a32 Collisions: 17 Elapsed: 00:00:00.4131953
roughly the collisions stayed on par but performance improved 2x.
Strings: 5,755
xxHash64 Collisions: 0 Elapsed: 00:00:00.0026576
GetHashCode Collisions: 0 Elapsed: 00:00:00.0032591
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0036303
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0045689
djb2 Collisions: 0 Elapsed: 00:00:00.0057463
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0061315
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0106836
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0108679
Marvin Collisions: 0 Elapsed: 00:00:00.0121878
xxHash32 Collisions: 0 Elapsed: 00:00:00.0131451
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0214134
Strings: 8,978
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0014915
GetHashCode Collisions: 0 Elapsed: 00:00:00.0016946
xxHash32 Collisions: 0 Elapsed: 00:00:00.0017028
xxHash64 Collisions: 0 Elapsed: 00:00:00.0018599
djb2 Collisions: 0 Elapsed: 00:00:00.0022298
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0022457
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0025141
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0042188
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0050270
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0051942
Marvin Collisions: 0 Elapsed: 00:00:00.0064653
Strings: 6,170
xxHash64 Collisions: 0 Elapsed: 00:00:00.0010432
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0011453
GetHashCode Collisions: 0 Elapsed: 00:00:00.0015364
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0015985
Marvin Collisions: 0 Elapsed: 00:00:00.0018468
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0023387
djb2 Collisions: 0 Elapsed: 00:00:00.0023605
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0023911
xxHash32 Collisions: 0 Elapsed: 00:00:00.0025984
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0041617
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0041768
Strings: 5,767
xxHash64 Collisions: 0 Elapsed: 00:00:00.0035053
xxHash32 Collisions: 0 Elapsed: 00:00:00.0046727
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0052897
GetHashCode Collisions: 0 Elapsed: 00:00:00.0074348
Marvin Collisions: 0 Elapsed: 00:00:00.0132039
djb2 Collisions: 0 Elapsed: 00:00:00.0140372
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0140615
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0141157
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0270089
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0271505
Murmur3-32 Collisions: 1 Elapsed: 00:00:00.0089661
Strings: 10,577
xxHash64 Collisions: 0 Elapsed: 00:00:00.0043155
xxHash32 Collisions: 0 Elapsed: 00:00:00.0055293
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0061946
GetHashCode Collisions: 0 Elapsed: 00:00:00.0073197
Marvin Collisions: 0 Elapsed: 00:00:00.0106193
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0132216
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0135098
djb2 Collisions: 0 Elapsed: 00:00:00.0136523
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0143501
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0250287
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0288256
Strings: 13,144
xxHash32 Collisions: 0 Elapsed: 00:00:00.0068207
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0073917
GetHashCode Collisions: 0 Elapsed: 00:00:00.0091019
xxHash64 Collisions: 0 Elapsed: 00:00:00.0091676
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0110823
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0167608
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0168389
djb2 Collisions: 0 Elapsed: 00:00:00.0171546
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0321957
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0352745
Marvin Collisions: 1 Elapsed: 00:00:00.0129518
Strings: 22,963
xxHash64 Collisions: 0 Elapsed: 00:00:00.0052700
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0060257
GetHashCode Collisions: 0 Elapsed: 00:00:00.0061548
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0077195
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0099857
djb2 Collisions: 0 Elapsed: 00:00:00.0106575
xxHash32 Collisions: 0 Elapsed: 00:00:00.0122694
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0124988
Marvin Collisions: 0 Elapsed: 00:00:00.0132141
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0175389
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0216942
Strings: 33,245
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0055987
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0083299
xxHash64 Collisions: 0 Elapsed: 00:00:00.0112533
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0129992
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0139026
GetHashCode Collisions: 0 Elapsed: 00:00:00.0142404
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0143126
xxHash32 Collisions: 0 Elapsed: 00:00:00.0151984
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0180864
Marvin Collisions: 0 Elapsed: 00:00:00.0205346
djb2 Collisions: 1 Elapsed: 00:00:00.0092903
Strings: 40,004
xxHash64 Collisions: 0 Elapsed: 00:00:00.0611167
xxHash32 Collisions: 0 Elapsed: 00:00:00.0808315
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.1196355
GetHashCode Collisions: 0 Elapsed: 00:00:00.1333144
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.1646332
Marvin Collisions: 0 Elapsed: 00:00:00.1926077
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.2671009
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.2682030
djb2 Collisions: 0 Elapsed: 00:00:00.2741916
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.5120110
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.5264215
Strings: 95,200
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.0522342
xxHash64 Collisions: 0 Elapsed: 00:00:00.0596649
Murmur3-32 Collisions: 0 Elapsed: 00:00:00.0954755
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.1185492
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.1993123
xxHash32 Collisions: 1 Elapsed: 00:00:00.0498529
GetHashCode Collisions: 1 Elapsed: 00:00:00.0617051
Marvin Collisions: 3 Elapsed: 00:00:00.0875014
Fnv1a32Fast Collisions: 2 Elapsed: 00:00:00.1141006
djb2 Collisions: 2 Elapsed: 00:00:00.1203841
Fnv1a32 Collisions: 2 Elapsed: 00:00:00.2019409
Strings: 288,134
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.1702985
xxHash64 Collisions: 0 Elapsed: 00:00:00.1863770
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.2373387
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.3931002
xxHash32 Collisions: 9 Elapsed: 00:00:00.1286781
GetHashCode Collisions: 9 Elapsed: 00:00:00.1491944
Murmur3-32 Collisions: 9 Elapsed: 00:00:00.1840540
Marvin Collisions: 12 Elapsed: 00:00:00.2290242
djb2 Collisions: 17 Elapsed: 00:00:00.2416959
Fnv1a32Fast Collisions: 9 Elapsed: 00:00:00.2426100
Fnv1a32 Collisions: 17 Elapsed: 00:00:00.3948087
Strings: 220,177
xxHash64 Collisions: 0 Elapsed: 00:00:00.1724258
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.2341446
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.5152870
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.9254929
xxHash32 Collisions: 1 Elapsed: 00:00:00.2250230
GetHashCode Collisions: 8 Elapsed: 00:00:00.2954302
Murmur3-32 Collisions: 6 Elapsed: 00:00:00.3391128
Marvin Collisions: 6 Elapsed: 00:00:00.4150414
djb2 Collisions: 12 Elapsed: 00:00:00.5085072
Fnv1a32Fast Collisions: 4 Elapsed: 00:00:00.5168778
Fnv1a32 Collisions: 4 Elapsed: 00:00:00.9458197
Strings: 302,685
xxHash64 Collisions: 0 Elapsed: 00:00:00.2679292
Murmur3-128 Collisions: 0 Elapsed: 00:00:00.4355118
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.9426571
Fnv1a64 Collisions: 0 Elapsed: 00:00:01.7275618
xxHash32 Collisions: 11 Elapsed: 00:00:00.3675883
GetHashCode Collisions: 16 Elapsed: 00:00:00.5649634
Murmur3-32 Collisions: 11 Elapsed: 00:00:00.5997929
Marvin Collisions: 15 Elapsed: 00:00:00.6960864
djb2 Collisions: 9 Elapsed: 00:00:00.9451107
Fnv1a32Fast Collisions: 15 Elapsed: 00:00:00.9599598
Fnv1a32 Collisions: 8 Elapsed: 00:00:01.7636629
Strings: 681,346
xxHash64 Collisions: 0 Elapsed: 00:00:04.3879496
Murmur3-128 Collisions: 0 Elapsed: 00:00:06.9035179
Fnv1a64 Collisions: 0 Elapsed: 00:00:08.8463328
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:15.8771657
Marvin Collisions: 71 Elapsed: 00:00:03.3497778
Murmur3-32 Collisions: 52 Elapsed: 00:00:06.7033923
djb2 Collisions: 57 Elapsed: 00:00:06.9822773
xxHash32 Collisions: 54 Elapsed: 00:00:07.6723920
Fnv1a32 Collisions: 67 Elapsed: 00:00:12.8489714
GetHashCode Collisions: 54 Elapsed: 00:00:18.8672386
Fnv1a32Fast Collisions: 56 Elapsed: 00:00:18.9755531
Strings: 2,744,490
Fnv1a64 Collisions: 0 Elapsed: 00:00:10.7489566
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:11.0249801
xxHash64 Collisions: 0 Elapsed: 00:00:11.5102876
Murmur3-128 Collisions: 0 Elapsed: 00:00:12.8930213
Fnv1a32Fast Collisions: 836 Elapsed: 00:00:08.2846016
xxHash32 Collisions: 913 Elapsed: 00:00:09.5995559
Murmur3-32 Collisions: 905 Elapsed: 00:00:09.6880563
djb2 Collisions: 943 Elapsed: 00:00:10.3829440
Fnv1a32 Collisions: 920 Elapsed: 00:00:11.8845191
GetHashCode Collisions: 880 Elapsed: 00:00:12.1517814
Marvin Collisions: 874 Elapsed: 00:00:12.7311613
Strings: 2,711,714
Murmur3-128 Collisions: 0 Elapsed: 00:00:10.1355745
xxHash64 Collisions: 0 Elapsed: 00:00:11.5539888
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:12.1999304
Fnv1a64 Collisions: 0 Elapsed: 00:00:13.8865074
Marvin Collisions: 861 Elapsed: 00:00:10.0699179
Fnv1a32Fast Collisions: 888 Elapsed: 00:00:11.0205952
GetHashCode Collisions: 817 Elapsed: 00:00:11.0887121
xxHash32 Collisions: 845 Elapsed: 00:00:11.8177209
djb2 Collisions: 842 Elapsed: 00:00:12.3049021
Murmur3-32 Collisions: 855 Elapsed: 00:00:12.3816100
Fnv1a32 Collisions: 859 Elapsed: 00:00:14.2399122
Strings: 746,697
xxHash64 Collisions: 0 Elapsed: 00:00:00.7954374
Murmur3-128 Collisions: 0 Elapsed: 00:00:01.1594128
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:02.4248640
Fnv1a64 Collisions: 0 Elapsed: 00:00:04.6697070
xxHash32 Collisions: 61 Elapsed: 00:00:01.2658189
Murmur3-32 Collisions: 74 Elapsed: 00:00:01.6817723
Marvin Collisions: 69 Elapsed: 00:00:01.8497144
GetHashCode Collisions: 67 Elapsed: 00:00:02.3598280
Fnv1a32Fast Collisions: 60 Elapsed: 00:00:02.4199466
djb2 Collisions: 67 Elapsed: 00:00:02.5917555
Fnv1a32 Collisions: 71 Elapsed: 00:00:04.5650691
Strings: 1,633,734
xxHash64 Collisions: 0 Elapsed: 00:00:01.7773656
Murmur3-128 Collisions: 0 Elapsed: 00:00:01.9570083
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:04.7777651
Fnv1a64 Collisions: 0 Elapsed: 00:00:05.8021424
xxHash32 Collisions: 308 Elapsed: 00:00:01.8537316
Murmur3-32 Collisions: 303 Elapsed: 00:00:02.4810003
Marvin Collisions: 315 Elapsed: 00:00:02.7430470
djb2 Collisions: 289 Elapsed: 00:00:03.3194793
Fnv1a32 Collisions: 330 Elapsed: 00:00:06.1967731
Fnv1a32Fast Collisions: 312 Elapsed: 00:00:12.0185989
GetHashCode Collisions: 324 Elapsed: 00:00:16.0363182