base: add a new FlatHashMap based on Swiss Table concepts #4205

LalitMaganti · 2025-12-31T00:02:56Z

Note: this CL does not change any production code to actually use
this. All it's doing is adding the implementation, benchmarks and tests.
Future CL will actually do the switch.

Our current FlatHashMap implementation has seved us well.
It was originally designed just to replace std::unordered_map inside of
TraceProcessor's StringPool but due to its nice perf characteristics,
it's come to be used throughout the Perfetto codebase.

But over time I've realised that it has some flows which make it not
suitable to be a truly general purpose hash table:

very sensitive to tombstones. we've made some modifications to
rehash if there are tombstones but in heavy insert/erase workloads
it's not enough
very sensitive to collisions: if there is a collision, the default is
quadratic probing, which can cause some quite sub-optimal memory
access patterns. And using linear probing doesn't work for high load
factor hash maps.
suboptimal memory layout: putting the ctrl/keys/values in different
arrays means that we're jumping around memory instead of being in a
single place, especially important for small maps.

All in all, these are all things that absl::flat_has_map and Swiss
tables have solved. There's no reason for us not to take advantage of at
least their memory layout and orgniaation ideas even if we cannot use
them directly. If they're good enough for much of the C++ world (and
definitely all of Google) they should be good enough for us too!

So this CL adds a new FlatHashMapV2 which is based on Swiss Table
concepths. Specifically;

The idea of storing the control bytes, keys and values all in the same
array
The logic to probe using triangular numbers and do group based
probing.
Using SIMD to access 8/16 elements at a time and do comparisions of
hashes.
The idea to store 7 bytes of a hash in the control byte to do fast
checks and reduce the probability of a false match to 1/128 before
even checking keys
Using SWAR for non-SIMD platforms (including ARM64 though we might
optimize this with NEON in the future depending on what we see)

The benchmarks are quite good for general cases.

For TraceStrings:

BM_HashMap_InsertTraceStrings_AppendOnly                        554254588 ns    553651249 ns            1 409.798M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<Ours_PreHashed>                   698876584 ns    698092350 ns            1 325.008M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<OursV2_PreHashed>                 662225759 ns    661307563 ns            1 343.086M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<StdUnorderedMap_PreHashed>       2376921965 ns   2374903076 ns            1 95.5346M/s   226.885M   1.17868M
BM_HashMap_InsertTraceStrings<AbslFlatHashMap_PreHashed>        631832647 ns    631347455 ns            1 359.367M/s   226.885M   1.17868M

This is the only case where the old implementation wins (which makes
sense because it was specifically designed for it!). But I don't think a
loss here justifies the major wins elsewhere (especially also because in
the "normal" case, cache misses would drown out the small differences
here anyway given how widespread interning is in practice. It also only
wins with AppendOnly which is rarely used.

For TraceTids:

BM_HashMap_TraceTids<Ours_Tid>                                  741954852 ns    741389092 ns            1 183.063M/s    16.099k
BM_HashMap_TraceTids<OursV2_Tid>                                504387549 ns    503475605 ns            1 269.568M/s    16.099k
BM_HashMap_TraceTids<StdUnorderedMap_Tid>                      1401707453 ns   1400083835 ns            1 96.9378M/s    16.099k
BM_HashMap_TraceTids<AbslFlatHashMap_Tid>                       510576964 ns    510105976 ns            1 266.064M/s    16.099k

We're signifcantly better than V1 here: tids is exactly the sort of
"general behaviour" which it's very common to see in Perfetto and where
V1 behaves less well than I'd like.

For LookupRandInts:

BM_HashMap_LookupRandInts<Ours_Default>                         574181027 ns    573859405 ns            1 17.4259M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_Default>                       350956969 ns    350711465 ns            2 28.5135M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Default>              571851786 ns    571499561 ns            1 17.4978M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Default>              366032708 ns    365819315 ns            2 27.3359M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_Murmur>               611111313 ns    610638342 ns            1 16.3763M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_Murmur>               363292801 ns    362867428 ns            2 27.5583M/s   10.7378P
BM_HashMap_LookupRandInts<Ours_PreHashed>                       713060262 ns    712048102 ns            1  14.044M/s   10.7378P
BM_HashMap_LookupRandInts<OursV2_PreHashed>                     329546786 ns    329117625 ns            2 30.3843M/s   10.7378P
BM_HashMap_LookupRandInts<StdUnorderedMap_PreHashed>            580294840 ns    579547604 ns            1 17.2548M/s   10.7378P
BM_HashMap_LookupRandInts<AbslFlatHashMap_PreHashed>            340158501 ns    339918114 ns            2 29.4188M/s   10.7378P

Another handy win which appears regardless of whether the hash function
is good or not. Notice how in the V1 case, prehashed causes poor
degeneration of the function but V2 is actually faster.

For InsertCollidingInts:

BM_HashMap_InsertCollidingInts<Ours_Default>                    701989492 ns    701578722 ns            1 14.2536M/s
BM_HashMap_InsertCollidingInts<OursV2_Default>                  427725299 ns    427302016 ns            2 23.4027M/s
BM_HashMap_InsertCollidingInts<StdUnorderedMap_Default>         896047311 ns    895384570 ns            1 11.1684M/s
BM_HashMap_InsertCollidingInts<AbslFlatHashMap_Default>         420245635 ns    420006692 ns            2 23.8091M/s

Another big win for V2. Another similar example to the sort we'd see in
practice in Perfetto.

For InsertDupeInts:

BM_HashMap_InsertDupeInts<Ours_Default>                          63926278 ns     63874785 ns           11 156.556M/s
BM_HashMap_InsertDupeInts<OursV2_Default>                        40291231 ns     40268380 ns           17 248.334M/s
BM_HashMap_InsertDupeInts<StdUnorderedMap_Default>               37554643 ns     37527602 ns           19 266.471M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Default>               29026714 ns     29008928 ns           24 344.721M/s
BM_HashMap_InsertDupeInts<AbslFlatHashMap_Murmur>                38883999 ns     38857011 ns           18 257.354M/s

Same as above, another common case. Surprisingly, std does very well
here. And we seem to be bottlenecked on the hash function given how much
better absl does with its own hash vs murmur (its hash is slightly lower
quality but also faster). This is the only benchmark where this happens
thoough, others don't show this behaviour.

For *String:

BM_HashMap_HeterogeneousLookup_String<Ours_String>              575812136 ns    575383946 ns            1 17.3797M/s
BM_HashMap_HeterogeneousLookup_String<OursV2_String>            420531571 ns    420120838 ns            2 23.8027M/s
BM_HashMap_RegularLookup_String<Ours_String>                    820167931 ns    819494830 ns            1 12.2026M/s
BM_HashMap_RegularLookup_String<OursV2_String>                  405975726 ns    405482422 ns            2  24.662M/s
BM_HashMap_RegularLookup_String<StdUnorderedMap_String>        1577833963 ns   1576585547 ns            1 6.34282M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String>         491132758 ns    490802459 ns            2 20.3748M/s
BM_HashMap_RegularLookup_String<AbslFlatHashMap_String_Murmur>  571520036 ns    571134272 ns            1  17.509M/s

All clear wins for us. We're doing actually much better than even absl
here!

InsertVaryingSizes:

BM_HashMap_InsertVaryingSize<Ours_Default>/100                        754 ns          754 ns       932167 132.677M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/10000                   200216 ns       200096 ns         3531 49.9759M/s
BM_HashMap_InsertVaryingSize<Ours_Default>/1000000               54744741 ns     54674135 ns           13 18.2902M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/100                     1381 ns         1380 ns       505094 72.4606M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/10000                 164971 ns       164851 ns         4250 60.6608M/s
BM_HashMap_InsertVaryingSize<OursV2_Default>/1000000             38351821 ns     38310617 ns           18 26.1024M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/100            1268 ns         1267 ns       551576 78.9371M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/10000        137933 ns       137845 ns         5065 72.5454M/s
BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/1000000    39980643 ns     39943874 ns           18 25.0351M/s

Note how v1 does better for very small maps but drops off as maps get
bigger. I think this is probably the effect of being able to have
everything in cache even with the different key/value/ctrl maps and SIMD
not being as useful with such small maps.

LookupWithMisses:

BM_HashMap_LookupWithMisses<Ours_Default>/0                     359571712 ns    359090314 ns            2 27.8481M/s
BM_HashMap_LookupWithMisses<Ours_Default>/50                    426399559 ns    425828348 ns            2 23.4836M/s
BM_HashMap_LookupWithMisses<Ours_Default>/100                   308777908 ns    308628319 ns            2 32.4014M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/0                   195633544 ns    195481786 ns            4 51.1557M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/50                  221380263 ns    221180357 ns            3  45.212M/s
BM_HashMap_LookupWithMisses<OursV2_Default>/100                 139350634 ns    139236152 ns            5 71.8204M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/0          201567396 ns    201412429 ns            4 49.6494M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/50         206378283 ns    206259753 ns            3 48.4826M/s
BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/100        123735273 ns    123601106 ns            6 80.9054M/s

Notice how much much better we do when we have a low/high miss rate.
This is indicative of the branch predicted doing a much better job of
actually dealing wit our usecases. But OTOH even 50% we are far
suprerior.

InsertSequential:

BM_HashMap_InsertSequentialInts<Ours_Default>                   781732694 ns    781402467 ns            1 12.7975M/s
BM_HashMap_InsertSequentialInts<OursV2_Default>                 423644733 ns    423098276 ns            2 23.6352M/s
BM_HashMap_InsertSequentialInts<StdUnorderedMap_Default>        473308044 ns    472810618 ns            2 21.1501M/s
BM_HashMap_InsertSequentialInts<AbslFlatHashMap_Default>        420184095 ns    419919524 ns            2 23.8141M/s

Huge win for what is quite a common case in trace processor. Even std is
doing better than us here which is quite surprising!

LookupSequential:

BM_HashMap_LookupSequentialInts<Ours_Default>                   586390209 ns    585999772 ns            1 17.0649M/s
BM_HashMap_LookupSequentialInts<OursV2_Default>                 344488692 ns    344312148 ns            2 29.0434M/s
BM_HashMap_LookupSequentialInts<StdUnorderedMap_Default>         49338334 ns     49299886 ns           15  202.84M/s
BM_HashMap_LookupSequentialInts<AbslFlatHashMap_Default>        230591313 ns    230421841 ns            3 43.3987M/s

Std destroys everyone which I find highly surprising and don't have a
good explanation for. Maybe it's because it has a perfect hash function
basically (since by default std::hash just returns the int?). Regardless
we're doing better than V1.

Note: this CL does *not* change any production code to actually use this. All it's doing is adding the implementation, benchmarks and tests. Future CL will actually do the switch. Our current FlatHashMap implementation has seved us well. It was originally designed just to replace std::unordered_map inside of TraceProcessor's StringPool but due to its nice perf characteristics, it's come to be used throughout the Perfetto codebase. But over time I've realised that it has some flows which make it not suitable to be a truly general purpose hash table: * very sensitive to tombstones. we've made some modifications to rehash if there are tombstones but in heavy insert/erase workloads it's not enough * very sensitive to collisions: if there is a collision, the default is quadratic probing, which can cause some quite sub-optimal memory access patterns. And using linear probing doesn't work for high load factor hash maps. * suboptimal memory layout: putting the ctrl/keys/values in different arrays means that we're jumping around memory instead of being in a single place, especially important for small maps. All in all, these are all things that absl::flat_has_map and Swiss tables have solved. There's no reason for us not to take advantage of at least their memory layout and orgniaation ideas even if we cannot use them directly. If they're good enough for much of the C++ world (and definitely all of Google) they should be good enough for us too! So this CL adds a new FlatHashMapV2 which is based on Swiss Table concepths. Specifically; * The idea of storing the control bytes, keys and values all in the same array * The logic to probe using triangular numbers and do group based probing. * Using SIMD to access 8/16 elements at a time and do comparisions of hashes. * The idea to store 7 bytes of a hash in the control byte to do fast checks and reduce the probability of a false match to 1/128 before even checking keys * Using SWAR for non-SIMD platforms (including ARM64 though we might optimize this with NEON in the future depending on what we see) The benchmarks are quite good for general cases. For TraceStrings: BM_HashMap_InsertTraceStrings_AppendOnly 554254588 ns 553651249 ns 1 409.798M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<Ours_PreHashed> 698876584 ns 698092350 ns 1 325.008M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<OursV2_PreHashed> 662225759 ns 661307563 ns 1 343.086M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<StdUnorderedMap_PreHashed> 2376921965 ns 2374903076 ns 1 95.5346M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<AbslFlatHashMap_PreHashed> 631832647 ns 631347455 ns 1 359.367M/s 226.885M 1.17868M This is the only case where the old implementation wins (which makes sense because it was specifically designed for it!). But I don't think a loss here justifies the major wins elsewhere (especially also because in the "normal" case, cache misses would drown out the small differences here anyway given how widespread interning is in practice. It also only wins with AppendOnly which is rarely used. For TraceTids: BM_HashMap_TraceTids<Ours_Tid> 741954852 ns 741389092 ns 1 183.063M/s 16.099k BM_HashMap_TraceTids<OursV2_Tid> 504387549 ns 503475605 ns 1 269.568M/s 16.099k BM_HashMap_TraceTids<StdUnorderedMap_Tid> 1401707453 ns 1400083835 ns 1 96.9378M/s 16.099k BM_HashMap_TraceTids<AbslFlatHashMap_Tid> 510576964 ns 510105976 ns 1 266.064M/s 16.099k We're signifcantly better than V1 here: tids is exactly the sort of "general behaviour" which it's very common to see in Perfetto and where V1 behaves less well than I'd like. For LookupRandInts: BM_HashMap_LookupRandInts<Ours_Default> 574181027 ns 573859405 ns 1 17.4259M/s 10.7378P BM_HashMap_LookupRandInts<OursV2_Default> 350956969 ns 350711465 ns 2 28.5135M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_Default> 571851786 ns 571499561 ns 1 17.4978M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_Default> 366032708 ns 365819315 ns 2 27.3359M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_Murmur> 611111313 ns 610638342 ns 1 16.3763M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_Murmur> 363292801 ns 362867428 ns 2 27.5583M/s 10.7378P BM_HashMap_LookupRandInts<Ours_PreHashed> 713060262 ns 712048102 ns 1 14.044M/s 10.7378P BM_HashMap_LookupRandInts<OursV2_PreHashed> 329546786 ns 329117625 ns 2 30.3843M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_PreHashed> 580294840 ns 579547604 ns 1 17.2548M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_PreHashed> 340158501 ns 339918114 ns 2 29.4188M/s 10.7378P Another handy win which appears regardless of whether the hash function is good or not. Notice how in the V1 case, prehashed causes poor degeneration of the function but V2 is actually *faster*. For InsertCollidingInts: BM_HashMap_InsertCollidingInts<Ours_Default> 701989492 ns 701578722 ns 1 14.2536M/s BM_HashMap_InsertCollidingInts<OursV2_Default> 427725299 ns 427302016 ns 2 23.4027M/s BM_HashMap_InsertCollidingInts<StdUnorderedMap_Default> 896047311 ns 895384570 ns 1 11.1684M/s BM_HashMap_InsertCollidingInts<AbslFlatHashMap_Default> 420245635 ns 420006692 ns 2 23.8091M/s Another big win for V2. Another similar example to the sort we'd see in practice in Perfetto. For InsertDupeInts: BM_HashMap_InsertDupeInts<Ours_Default> 63926278 ns 63874785 ns 11 156.556M/s BM_HashMap_InsertDupeInts<OursV2_Default> 40291231 ns 40268380 ns 17 248.334M/s BM_HashMap_InsertDupeInts<StdUnorderedMap_Default> 37554643 ns 37527602 ns 19 266.471M/s BM_HashMap_InsertDupeInts<AbslFlatHashMap_Default> 29026714 ns 29008928 ns 24 344.721M/s BM_HashMap_InsertDupeInts<AbslFlatHashMap_Murmur> 38883999 ns 38857011 ns 18 257.354M/s Same as above, another common case. Surprisingly, std does very well here. And we seem to be bottlenecked on the hash function given how much better absl does with its own hash vs murmur (its hash is slightly lower quality but also faster). This is the only benchmark where this happens thoough, others don't show this behaviour. For *String: BM_HashMap_HeterogeneousLookup_String<Ours_String> 575812136 ns 575383946 ns 1 17.3797M/s BM_HashMap_HeterogeneousLookup_String<OursV2_String> 420531571 ns 420120838 ns 2 23.8027M/s BM_HashMap_RegularLookup_String<Ours_String> 820167931 ns 819494830 ns 1 12.2026M/s BM_HashMap_RegularLookup_String<OursV2_String> 405975726 ns 405482422 ns 2 24.662M/s BM_HashMap_RegularLookup_String<StdUnorderedMap_String> 1577833963 ns 1576585547 ns 1 6.34282M/s BM_HashMap_RegularLookup_String<AbslFlatHashMap_String> 491132758 ns 490802459 ns 2 20.3748M/s BM_HashMap_RegularLookup_String<AbslFlatHashMap_String_Murmur> 571520036 ns 571134272 ns 1 17.509M/s All clear wins for us. We're doing actually much better than even absl here! InsertVaryingSizes: BM_HashMap_InsertVaryingSize<Ours_Default>/100 754 ns 754 ns 932167 132.677M/s BM_HashMap_InsertVaryingSize<Ours_Default>/10000 200216 ns 200096 ns 3531 49.9759M/s BM_HashMap_InsertVaryingSize<Ours_Default>/1000000 54744741 ns 54674135 ns 13 18.2902M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/100 1381 ns 1380 ns 505094 72.4606M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/10000 164971 ns 164851 ns 4250 60.6608M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/1000000 38351821 ns 38310617 ns 18 26.1024M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/100 1268 ns 1267 ns 551576 78.9371M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/10000 137933 ns 137845 ns 5065 72.5454M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/1000000 39980643 ns 39943874 ns 18 25.0351M/s Note how v1 does better for very small maps but drops off as maps get bigger. I think this is probably the effect of being able to have everything in cache even with the different key/value/ctrl maps and SIMD not being as useful with such small maps. LookupWithMisses: BM_HashMap_LookupWithMisses<Ours_Default>/0 359571712 ns 359090314 ns 2 27.8481M/s BM_HashMap_LookupWithMisses<Ours_Default>/50 426399559 ns 425828348 ns 2 23.4836M/s BM_HashMap_LookupWithMisses<Ours_Default>/100 308777908 ns 308628319 ns 2 32.4014M/s BM_HashMap_LookupWithMisses<OursV2_Default>/0 195633544 ns 195481786 ns 4 51.1557M/s BM_HashMap_LookupWithMisses<OursV2_Default>/50 221380263 ns 221180357 ns 3 45.212M/s BM_HashMap_LookupWithMisses<OursV2_Default>/100 139350634 ns 139236152 ns 5 71.8204M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/0 201567396 ns 201412429 ns 4 49.6494M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/50 206378283 ns 206259753 ns 3 48.4826M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/100 123735273 ns 123601106 ns 6 80.9054M/s Notice how much much better we do when we have a low/high miss rate. This is indicative of the branch predicted doing a much better job of actually dealing wit our usecases. But OTOH even 50% we are far suprerior. InsertSequential: BM_HashMap_InsertSequentialInts<Ours_Default> 781732694 ns 781402467 ns 1 12.7975M/s BM_HashMap_InsertSequentialInts<OursV2_Default> 423644733 ns 423098276 ns 2 23.6352M/s BM_HashMap_InsertSequentialInts<StdUnorderedMap_Default> 473308044 ns 472810618 ns 2 21.1501M/s BM_HashMap_InsertSequentialInts<AbslFlatHashMap_Default> 420184095 ns 419919524 ns 2 23.8141M/s Huge win for what is quite a common case in trace processor. Even std is doing better than us here which is quite surprising! LookupSequential: BM_HashMap_LookupSequentialInts<Ours_Default> 586390209 ns 585999772 ns 1 17.0649M/s BM_HashMap_LookupSequentialInts<OursV2_Default> 344488692 ns 344312148 ns 2 29.0434M/s BM_HashMap_LookupSequentialInts<StdUnorderedMap_Default> 49338334 ns 49299886 ns 15 202.84M/s BM_HashMap_LookupSequentialInts<AbslFlatHashMap_Default> 230591313 ns 230421841 ns 3 43.3987M/s Std destroys everyone which I find highly surprising and don't have a good explanation for. Maybe it's because it has a perfect hash function basically (since by default std::hash just returns the int?). Regardless we're doing better than V1.

github-actions · 2025-12-31T00:15:42Z

🎨 Perfetto UI Build

✅ UI build is ready: https://storage.googleapis.com/perfetto-ci-artifacts/gh-20718088702-1-ui/ui/index.html

This reverts commit 44c8b9a.

primiano

LG with various comments

primiano · 2026-01-07T18:34:10Z