-
-
Notifications
You must be signed in to change notification settings - Fork 207
String Hashing
Looking at deduplicating strings in an MSBuild binlog. Can't hold on to strings to catch collisions so need to rely on hash function not having collisions. Datasets include binlogs with up to 2.7 million strings. Different binlogs have different distributions on string sizes, so hash function behavior will be very different on similar-sized binlogs.
- benchmarks and source code: https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.cs
- https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.Fnv.cs
- https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.Marvin.cs (Thanks to https://twitter.com/ben_a_adams)
- https://github.com/KirillOsenkov/Benchmarks/blob/master/Benchmarks/Tests/StringHash.djb2.cs
- https://github.com/uranium62/xxHash/tree/master/src/Standart.Hash.xxHash
I've evaluated 32-bit and 64-bit hash sizes. All functions tested are roughly equivalent in terms of collisions and provide a good even distribution. The Birthday Paradox roughly estimates the number of collisions per dataset size. I've found that 32-bit hash size is not enough for the datasets. All hash functions start having collisions between 30,000-100,000 strings. For largest binlogs at 2.7 million strings all 32-bit functions have ~800 collisions which is too much. Having a collision would mean the user would be shown a completely unrelated random string in a random location and this is very confusing.
Good news is that 64-bit hash size reliably had 0 collisions for all binlogs I tested. I don't know how much larger the datasets have to get to start hitting 64-bit collisions.
Going with a 64-bit hash size.
The results vary heavily based on the binlog. xxHash (both 32-bit and 64-bit) stood out as a winner usually. FNV behaved well.
Results for the largest binlog:
Strings: 2,744,490
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:11.4512528
xxHash64 Collisions: 0 Elapsed: 00:00:12.4725302
Fnv1a64 Collisions: 0 Elapsed: 00:00:20.6213482
Marvin Collisions: 848 Elapsed: 00:00:10.1477506
xxHash32 Collisions: 913 Elapsed: 00:00:11.7118423
GetHashCode Collisions: 880 Elapsed: 00:00:12.6015684
Fnv1a32Fast Collisions: 836 Elapsed: 00:00:12.9828404
Fnv1a32 Collisions: 920 Elapsed: 00:00:14.3969376
djb2 Collisions: 943 Elapsed: 00:00:14.6716832
When hashing a string the canonical FNV-1a dictates that each octet needs to be mixed in separately. So for a .NET char you need to do it twice:
char ch = text[i];
byte b = (byte)ch;
hash ^= b;
hash *= Prime;
b = (byte)(ch >> 8);
hash ^= b;
hash *= Prime;
However I decided to see if I can get away with mixing in the entire char in one fell swoop:
char ch = text[i];
hash = (hash ^ ch) * Prime;
and the 64-bit hash still had no collisions and performance improved roughly 2x, so going to go with this optimization. This optimization for 32-bit did result in more collisions than the canonical version in some situations:
Fnv1a32Fast Collisions: 15 Elapsed: 00:00:00.9699976
Fnv1a32 Collisions: 8 Elapsed: 00:00:01.7843502
but not in others:
Fnv1a32Fast Collisions: 9 Elapsed: 00:00:00.2443658
Fnv1a32 Collisions: 17 Elapsed: 00:00:00.4131953
roughly the collisions stayed on par but performance improved 2x.
Strings: 5,755
xxHash64 Collisions: 0 Elapsed: 00:00:00.0026741
GetHashCode Collisions: 0 Elapsed: 00:00:00.0031527
djb2 Collisions: 0 Elapsed: 00:00:00.0057164
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0057362
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0063472
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0103995
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0112836
Marvin Collisions: 0 Elapsed: 00:00:00.0121214
xxHash32 Collisions: 0 Elapsed: 00:00:00.0130681
Strings: 8,978
xxHash32 Collisions: 0 Elapsed: 00:00:00.0014757
xxHash64 Collisions: 0 Elapsed: 00:00:00.0015259
GetHashCode Collisions: 0 Elapsed: 00:00:00.0016200
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0025707
djb2 Collisions: 0 Elapsed: 00:00:00.0036661
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0041440
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0047749
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0053239
Marvin Collisions: 0 Elapsed: 00:00:00.0056475
Strings: 5,767
xxHash64 Collisions: 0 Elapsed: 00:00:00.0033523
xxHash32 Collisions: 0 Elapsed: 00:00:00.0048384
GetHashCode Collisions: 0 Elapsed: 00:00:00.0073650
Marvin Collisions: 0 Elapsed: 00:00:00.0128889
djb2 Collisions: 0 Elapsed: 00:00:00.0138964
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0140239
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0140474
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0267951
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0268189
Strings: 6,170
xxHash64 Collisions: 0 Elapsed: 00:00:00.0013467
GetHashCode Collisions: 0 Elapsed: 00:00:00.0014064
Marvin Collisions: 0 Elapsed: 00:00:00.0021564
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0022922
djb2 Collisions: 0 Elapsed: 00:00:00.0023367
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0025382
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0041534
xxHash32 Collisions: 0 Elapsed: 00:00:00.0041642
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0042534
Strings: 10,577
xxHash64 Collisions: 0 Elapsed: 00:00:00.0042240
xxHash32 Collisions: 0 Elapsed: 00:00:00.0051933
GetHashCode Collisions: 0 Elapsed: 00:00:00.0070581
Marvin Collisions: 0 Elapsed: 00:00:00.0101968
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0130789
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0131046
djb2 Collisions: 0 Elapsed: 00:00:00.0132978
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0243679
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0276124
Strings: 13,144
xxHash32 Collisions: 0 Elapsed: 00:00:00.0069490
GetHashCode Collisions: 0 Elapsed: 00:00:00.0087925
xxHash64 Collisions: 0 Elapsed: 00:00:00.0098605
Marvin Collisions: 0 Elapsed: 00:00:00.0134604
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0171348
djb2 Collisions: 0 Elapsed: 00:00:00.0174824
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0176932
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0321770
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0352590
Strings: 22,963
xxHash64 Collisions: 0 Elapsed: 00:00:00.0053455
GetHashCode Collisions: 0 Elapsed: 00:00:00.0066013
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0101986
djb2 Collisions: 0 Elapsed: 00:00:00.0106974
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0128123
xxHash32 Collisions: 0 Elapsed: 00:00:00.0132219
Marvin Collisions: 0 Elapsed: 00:00:00.0139921
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0178608
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0218732
Strings: 33,245
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.0085631
djb2 Collisions: 1 Elapsed: 00:00:00.0094651
xxHash64 Collisions: 0 Elapsed: 00:00:00.0124513
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.0126791
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.0141275
GetHashCode Collisions: 0 Elapsed: 00:00:00.0148957
xxHash32 Collisions: 0 Elapsed: 00:00:00.0152184
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.0188528
Marvin Collisions: 0 Elapsed: 00:00:00.0276599
Strings: 40,004
xxHash64 Collisions: 0 Elapsed: 00:00:00.0703521
xxHash32 Collisions: 0 Elapsed: 00:00:00.0835046
GetHashCode Collisions: 0 Elapsed: 00:00:00.1443608
Marvin Collisions: 0 Elapsed: 00:00:00.1970208
djb2 Collisions: 0 Elapsed: 00:00:00.2701778
Fnv1a32Fast Collisions: 0 Elapsed: 00:00:00.2723283
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.2756372
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.5228534
Fnv1a32 Collisions: 0 Elapsed: 00:00:00.5255990
Strings: 95,200
xxHash64 Collisions: 0 Elapsed: 00:00:00.0591592
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.1143691
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.1989501
xxHash32 Collisions: 1 Elapsed: 00:00:00.0499008
GetHashCode Collisions: 1 Elapsed: 00:00:00.0622880
Marvin Collisions: 1 Elapsed: 00:00:00.0859473
Fnv1a32Fast Collisions: 2 Elapsed: 00:00:00.1122614
djb2 Collisions: 2 Elapsed: 00:00:00.1145779
Fnv1a32 Collisions: 2 Elapsed: 00:00:00.2019418
Strings: 220,177
xxHash64 Collisions: 0 Elapsed: 00:00:00.1688599
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.5063542
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.9296767
xxHash32 Collisions: 1 Elapsed: 00:00:00.2017256
GetHashCode Collisions: 8 Elapsed: 00:00:00.2838376
Marvin Collisions: 6 Elapsed: 00:00:00.3951592
Fnv1a32Fast Collisions: 4 Elapsed: 00:00:00.4974939
djb2 Collisions: 12 Elapsed: 00:00:00.5339619
Fnv1a32 Collisions: 4 Elapsed: 00:00:00.9415688
Strings: 288,134
xxHash64 Collisions: 0 Elapsed: 00:00:00.1169812
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.2529665
Fnv1a64 Collisions: 0 Elapsed: 00:00:00.3995170
xxHash32 Collisions: 9 Elapsed: 00:00:00.1363369
GetHashCode Collisions: 9 Elapsed: 00:00:00.1457226
Marvin Collisions: 11 Elapsed: 00:00:00.1868881
djb2 Collisions: 17 Elapsed: 00:00:00.2395169
Fnv1a32Fast Collisions: 9 Elapsed: 00:00:00.2443658
Fnv1a32 Collisions: 17 Elapsed: 00:00:00.4131953
Strings: 302,685
xxHash64 Collisions: 0 Elapsed: 00:00:00.2654320
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:00.9521811
Fnv1a64 Collisions: 0 Elapsed: 00:00:01.7610004
xxHash32 Collisions: 11 Elapsed: 00:00:00.4000200
GetHashCode Collisions: 16 Elapsed: 00:00:00.5674189
Marvin Collisions: 16 Elapsed: 00:00:00.7259080
djb2 Collisions: 9 Elapsed: 00:00:00.9613453
Fnv1a32Fast Collisions: 15 Elapsed: 00:00:00.9699976
Fnv1a32 Collisions: 8 Elapsed: 00:00:01.7843502
Strings: 681,346
xxHash64 Collisions: 0 Elapsed: 00:00:14.2564708
Fnv1a64 Collisions: 0 Elapsed: 00:00:15.7363333
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:17.6527663
xxHash32 Collisions: 54 Elapsed: 00:00:13.1265490
Fnv1a32Fast Collisions: 56 Elapsed: 00:00:16.1640268
djb2 Collisions: 57 Elapsed: 00:00:16.9421441
Marvin Collisions: 50 Elapsed: 00:00:17.4443557
GetHashCode Collisions: 54 Elapsed: 00:00:19.2028944
Fnv1a32 Collisions: 67 Elapsed: 00:00:20.4274275
Strings: 746,697
xxHash64 Collisions: 0 Elapsed: 00:00:01.0335703
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:03.0520370
Fnv1a64 Collisions: 0 Elapsed: 00:00:05.7027973
xxHash32 Collisions: 61 Elapsed: 00:00:01.3509968
Marvin Collisions: 74 Elapsed: 00:00:02.1738263
djb2 Collisions: 67 Elapsed: 00:00:02.8742604
Fnv1a32Fast Collisions: 60 Elapsed: 00:00:03.8859331
Fnv1a32 Collisions: 71 Elapsed: 00:00:05.3893214
GetHashCode Collisions: 67 Elapsed: 00:00:11.3145967
Strings: 1,633,734
xxHash64 Collisions: 0 Elapsed: 00:00:01.4111786
Fnv1a64 Collisions: 0 Elapsed: 00:00:06.9580604
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:13.3455648
xxHash32 Collisions: 308 Elapsed: 00:00:01.9220260
Marvin Collisions: 323 Elapsed: 00:00:03.0514329
djb2 Collisions: 289 Elapsed: 00:00:03.3404543
Fnv1a32 Collisions: 330 Elapsed: 00:00:11.9088878
Fnv1a32Fast Collisions: 312 Elapsed: 00:00:15.3062299
GetHashCode Collisions: 324 Elapsed: 00:00:16.1924285
Strings: 2,711,714
xxHash64 Collisions: 0 Elapsed: 00:00:07.9579064
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:08.2236800
Fnv1a64 Collisions: 0 Elapsed: 00:00:20.7533625
Marvin Collisions: 858 Elapsed: 00:00:05.8786134
xxHash32 Collisions: 845 Elapsed: 00:00:11.5977205
GetHashCode Collisions: 817 Elapsed: 00:00:11.7662504
Fnv1a32Fast Collisions: 888 Elapsed: 00:00:13.0753472
djb2 Collisions: 842 Elapsed: 00:00:14.4005871
Fnv1a32 Collisions: 859 Elapsed: 00:00:15.3333468
Strings: 2,744,490
Fnv1a64Fast Collisions: 0 Elapsed: 00:00:11.4512528
xxHash64 Collisions: 0 Elapsed: 00:00:12.4725302
Fnv1a64 Collisions: 0 Elapsed: 00:00:20.6213482
Marvin Collisions: 848 Elapsed: 00:00:10.1477506
xxHash32 Collisions: 913 Elapsed: 00:00:11.7118423
GetHashCode Collisions: 880 Elapsed: 00:00:12.6015684
Fnv1a32Fast Collisions: 836 Elapsed: 00:00:12.9828404
Fnv1a32 Collisions: 920 Elapsed: 00:00:14.3969376
djb2 Collisions: 943 Elapsed: 00:00:14.6716832