Skip to content

String Hashing

Kirill Osenkov edited this page Jan 7, 2021 · 12 revisions

Evaluating non-cryptographic zero-allocation string hash functions

Looking at deduplicating strings in an MSBuild binlog. Can't hold on to strings to catch collisions so need to rely on hash function not having collisions. Datasets include binlogs with up to 2.7 million strings. Different binlogs have different distributions on string sizes, so hash function behavior will be very different on similar-sized binlogs.

Links:

Collisions

I've evaluated 32-bit and 64-bit hash sizes. All functions tested are roughly equivalent in terms of collisions and provide a good even distribution. The Birthday Paradox roughly estimates the number of collisions per dataset size. I've found that 32-bit hash size is not enough for the datasets. All hash functions start having collisions between 30,000-100,000 strings. For largest binlogs at 2.7 million strings all 32-bit functions have ~800 collisions which is too much. Having a collision would mean the user would be shown a completely unrelated random string in a random location and this is very confusing.

Good news is that 64-bit hash size reliably had 0 collisions for all binlogs I tested. I don't know how much larger the datasets have to get to start hitting 64-bit collisions.

Going with a 64-bit hash size.

Perf

The results vary heavily based on the binlog. xxHash (both 32-bit and 64-bit) stood out as a winner usually. FNV behaved well.

Results for the largest binlog:

Strings: 2,744,490
  Fnv1a64Fast   Collisions: 0 Elapsed: 00:00:11.4512528
  xxHash64      Collisions: 0 Elapsed: 00:00:12.4725302
  Fnv1a64       Collisions: 0 Elapsed: 00:00:20.6213482
  Marvin        Collisions: 848 Elapsed: 00:00:10.1477506
  xxHash32      Collisions: 913 Elapsed: 00:00:11.7118423
  GetHashCode   Collisions: 880 Elapsed: 00:00:12.6015684
  Fnv1a32Fast   Collisions: 836 Elapsed: 00:00:12.9828404
  Fnv1a32       Collisions: 920 Elapsed: 00:00:14.3969376
  djb2          Collisions: 943 Elapsed: 00:00:14.6716832

Faster FNV-1a

When hashing a string the canonical FNV-1a dictates that each octet needs to be mixed in separately. So for a .NET char you need to do it twice:

    char ch = text[i];
    byte b = (byte)ch;
    hash ^= b;
    hash *= Prime;

    b = (byte)(ch >> 8);
    hash ^= b;
    hash *= Prime;

However I decided to see if I can get away with mixing in the entire char in one fell swoop:

    char ch = text[i];
    hash = (hash ^ ch) * Prime;

and the 64-bit hash still had no collisions and performance improved roughly 2x, so going to go with this optimization. This optimization for 32-bit did result in more collisions than the canonical version in some situations:

  Fnv1a32Fast  Collisions: 15  Elapsed: 00:00:00.9699976
  Fnv1a32      Collisions: 8   Elapsed: 00:00:01.7843502

but not in others:

  Fnv1a32Fast  Collisions: 9   Elapsed: 00:00:00.2443658
  Fnv1a32      Collisions: 17  Elapsed: 00:00:00.4131953

roughly the collisions stayed on par but performance improved 2x.

Full results

Strings: 5,755
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0026741
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0031527
  djb2			Collisions: 0 Elapsed: 00:00:00.0057164
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0057362
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0063472
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0103995
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0112836
  Marvin		Collisions: 0 Elapsed: 00:00:00.0121214
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0130681

Strings: 8,978
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0014757
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0015259
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0016200
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0025707
  djb2			Collisions: 0 Elapsed: 00:00:00.0036661
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0041440
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0047749
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0053239
  Marvin		Collisions: 0 Elapsed: 00:00:00.0056475

Strings: 5,767
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0033523
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0048384
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0073650
  Marvin		Collisions: 0 Elapsed: 00:00:00.0128889
  djb2			Collisions: 0 Elapsed: 00:00:00.0138964
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0140239
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0140474
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0267951
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0268189

Strings: 6,170
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0013467
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0014064
  Marvin		Collisions: 0 Elapsed: 00:00:00.0021564
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0022922
  djb2			Collisions: 0 Elapsed: 00:00:00.0023367
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0025382
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0041534
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0041642
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0042534

Strings: 10,577
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0042240
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0051933
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0070581
  Marvin		Collisions: 0 Elapsed: 00:00:00.0101968
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0130789
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0131046
  djb2			Collisions: 0 Elapsed: 00:00:00.0132978
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0243679
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0276124

Strings: 13,144
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0069490
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0087925
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0098605
  Marvin		Collisions: 0 Elapsed: 00:00:00.0134604
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0171348
  djb2			Collisions: 0 Elapsed: 00:00:00.0174824
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0176932
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0321770
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0352590

Strings: 22,963
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0053455
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0066013
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0101986
  djb2			Collisions: 0 Elapsed: 00:00:00.0106974
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0128123
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0132219
  Marvin		Collisions: 0 Elapsed: 00:00:00.0139921
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0178608
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0218732

Strings: 33,245
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.0085631
  djb2			Collisions: 1 Elapsed: 00:00:00.0094651
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0124513
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.0126791
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.0141275
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.0148957
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0152184
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.0188528
  Marvin		Collisions: 0 Elapsed: 00:00:00.0276599

Strings: 40,004
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0703521
  xxHash32		Collisions: 0 Elapsed: 00:00:00.0835046
  GetHashCode	Collisions: 0 Elapsed: 00:00:00.1443608
  Marvin		Collisions: 0 Elapsed: 00:00:00.1970208
  djb2			Collisions: 0 Elapsed: 00:00:00.2701778
  Fnv1a32Fast	Collisions: 0 Elapsed: 00:00:00.2723283
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.2756372
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.5228534
  Fnv1a32		Collisions: 0 Elapsed: 00:00:00.5255990

Strings: 95,200
  xxHash64		Collisions: 0 Elapsed: 00:00:00.0591592
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.1143691
  Fnv1a64		Collisions: 0 Elapsed: 00:00:00.1989501
  xxHash32		Collisions: 1 Elapsed: 00:00:00.0499008
  GetHashCode	Collisions: 1 Elapsed: 00:00:00.0622880
  Marvin		Collisions: 1 Elapsed: 00:00:00.0859473
  Fnv1a32Fast	Collisions: 2 Elapsed: 00:00:00.1122614
  djb2			Collisions: 2 Elapsed: 00:00:00.1145779
  Fnv1a32		Collisions: 2 Elapsed: 00:00:00.2019418

Strings: 220,177
  xxHash64		Collisions: 0	Elapsed: 00:00:00.1688599
  Fnv1a64Fast	Collisions: 0	Elapsed: 00:00:00.5063542
  Fnv1a64		Collisions: 0	Elapsed: 00:00:00.9296767
  xxHash32		Collisions: 1	Elapsed: 00:00:00.2017256
  GetHashCode	Collisions: 8	Elapsed: 00:00:00.2838376
  Marvin		Collisions: 6	Elapsed: 00:00:00.3951592
  Fnv1a32Fast	Collisions: 4	Elapsed: 00:00:00.4974939
  djb2			Collisions: 12	Elapsed: 00:00:00.5339619
  Fnv1a32		Collisions: 4	Elapsed: 00:00:00.9415688

Strings: 288,134
  xxHash64		Collisions: 0	Elapsed: 00:00:00.1169812
  Fnv1a64Fast	Collisions: 0	Elapsed: 00:00:00.2529665
  Fnv1a64		Collisions: 0	Elapsed: 00:00:00.3995170
  xxHash32		Collisions: 9	Elapsed: 00:00:00.1363369
  GetHashCode	Collisions: 9	Elapsed: 00:00:00.1457226
  Marvin		Collisions: 11	Elapsed: 00:00:00.1868881
  djb2			Collisions: 17	Elapsed: 00:00:00.2395169
  Fnv1a32Fast	Collisions: 9	Elapsed: 00:00:00.2443658
  Fnv1a32		Collisions: 17	Elapsed: 00:00:00.4131953

Strings: 302,685
  xxHash64		Collisions: 0 Elapsed: 00:00:00.2654320
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:00.9521811
  Fnv1a64		Collisions: 0 Elapsed: 00:00:01.7610004
  xxHash32		Collisions: 11 Elapsed: 00:00:00.4000200
  GetHashCode	Collisions: 16 Elapsed: 00:00:00.5674189
  Marvin		Collisions: 16 Elapsed: 00:00:00.7259080
  djb2			Collisions: 9 Elapsed: 00:00:00.9613453
  Fnv1a32Fast	Collisions: 15 Elapsed: 00:00:00.9699976
  Fnv1a32		Collisions: 8 Elapsed: 00:00:01.7843502

Strings: 681,346
  xxHash64		Collisions: 0 Elapsed: 00:00:14.2564708
  Fnv1a64		Collisions: 0 Elapsed: 00:00:15.7363333
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:17.6527663
  xxHash32		Collisions: 54 Elapsed: 00:00:13.1265490
  Fnv1a32Fast	Collisions: 56 Elapsed: 00:00:16.1640268
  djb2			Collisions: 57 Elapsed: 00:00:16.9421441
  Marvin		Collisions: 50 Elapsed: 00:00:17.4443557
  GetHashCode	Collisions: 54 Elapsed: 00:00:19.2028944
  Fnv1a32		Collisions: 67 Elapsed: 00:00:20.4274275

Strings: 746,697
  xxHash64		Collisions: 0 Elapsed: 00:00:01.0335703
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:03.0520370
  Fnv1a64		Collisions: 0 Elapsed: 00:00:05.7027973
  xxHash32		Collisions: 61 Elapsed: 00:00:01.3509968
  Marvin		Collisions: 74 Elapsed: 00:00:02.1738263
  djb2			Collisions: 67 Elapsed: 00:00:02.8742604
  Fnv1a32Fast	Collisions: 60 Elapsed: 00:00:03.8859331
  Fnv1a32		Collisions: 71 Elapsed: 00:00:05.3893214
  GetHashCode	Collisions: 67 Elapsed: 00:00:11.3145967

Strings: 1,633,734
  xxHash64		Collisions: 0 Elapsed: 00:00:01.4111786
  Fnv1a64		Collisions: 0 Elapsed: 00:00:06.9580604
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:13.3455648
  xxHash32		Collisions: 308 Elapsed: 00:00:01.9220260
  Marvin		Collisions: 323 Elapsed: 00:00:03.0514329
  djb2			Collisions: 289 Elapsed: 00:00:03.3404543
  Fnv1a32		Collisions: 330 Elapsed: 00:00:11.9088878
  Fnv1a32Fast	Collisions: 312 Elapsed: 00:00:15.3062299
  GetHashCode	Collisions: 324 Elapsed: 00:00:16.1924285

Strings: 2,711,714
  xxHash64		Collisions: 0 Elapsed: 00:00:07.9579064
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:08.2236800
  Fnv1a64		Collisions: 0 Elapsed: 00:00:20.7533625
  Marvin		Collisions: 858 Elapsed: 00:00:05.8786134
  xxHash32		Collisions: 845 Elapsed: 00:00:11.5977205
  GetHashCode	Collisions: 817 Elapsed: 00:00:11.7662504
  Fnv1a32Fast	Collisions: 888 Elapsed: 00:00:13.0753472
  djb2			Collisions: 842 Elapsed: 00:00:14.4005871
  Fnv1a32		Collisions: 859 Elapsed: 00:00:15.3333468

Strings: 2,744,490
  Fnv1a64Fast	Collisions: 0 Elapsed: 00:00:11.4512528
  xxHash64		Collisions: 0 Elapsed: 00:00:12.4725302
  Fnv1a64		Collisions: 0 Elapsed: 00:00:20.6213482
  Marvin		Collisions: 848 Elapsed: 00:00:10.1477506
  xxHash32		Collisions: 913 Elapsed: 00:00:11.7118423
  GetHashCode	Collisions: 880 Elapsed: 00:00:12.6015684
  Fnv1a32Fast	Collisions: 836 Elapsed: 00:00:12.9828404
  Fnv1a32		Collisions: 920 Elapsed: 00:00:14.3969376
  djb2			Collisions: 943 Elapsed: 00:00:14.6716832
Clone this wiki locally