-
Notifications
You must be signed in to change notification settings - Fork 663
base: add a new FlatHashMap based on Swiss Table concepts #4205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note: this CL does *not* change any production code to actually use this. All it's doing is adding the implementation, benchmarks and tests. Future CL will actually do the switch. Our current FlatHashMap implementation has seved us well. It was originally designed just to replace std::unordered_map inside of TraceProcessor's StringPool but due to its nice perf characteristics, it's come to be used throughout the Perfetto codebase. But over time I've realised that it has some flows which make it not suitable to be a truly general purpose hash table: * very sensitive to tombstones. we've made some modifications to rehash if there are tombstones but in heavy insert/erase workloads it's not enough * very sensitive to collisions: if there is a collision, the default is quadratic probing, which can cause some quite sub-optimal memory access patterns. And using linear probing doesn't work for high load factor hash maps. * suboptimal memory layout: putting the ctrl/keys/values in different arrays means that we're jumping around memory instead of being in a single place, especially important for small maps. All in all, these are all things that absl::flat_has_map and Swiss tables have solved. There's no reason for us not to take advantage of at least their memory layout and orgniaation ideas even if we cannot use them directly. If they're good enough for much of the C++ world (and definitely all of Google) they should be good enough for us too! So this CL adds a new FlatHashMapV2 which is based on Swiss Table concepths. Specifically; * The idea of storing the control bytes, keys and values all in the same array * The logic to probe using triangular numbers and do group based probing. * Using SIMD to access 8/16 elements at a time and do comparisions of hashes. * The idea to store 7 bytes of a hash in the control byte to do fast checks and reduce the probability of a false match to 1/128 before even checking keys * Using SWAR for non-SIMD platforms (including ARM64 though we might optimize this with NEON in the future depending on what we see) The benchmarks are quite good for general cases. For TraceStrings: BM_HashMap_InsertTraceStrings_AppendOnly 554254588 ns 553651249 ns 1 409.798M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<Ours_PreHashed> 698876584 ns 698092350 ns 1 325.008M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<OursV2_PreHashed> 662225759 ns 661307563 ns 1 343.086M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<StdUnorderedMap_PreHashed> 2376921965 ns 2374903076 ns 1 95.5346M/s 226.885M 1.17868M BM_HashMap_InsertTraceStrings<AbslFlatHashMap_PreHashed> 631832647 ns 631347455 ns 1 359.367M/s 226.885M 1.17868M This is the only case where the old implementation wins (which makes sense because it was specifically designed for it!). But I don't think a loss here justifies the major wins elsewhere (especially also because in the "normal" case, cache misses would drown out the small differences here anyway given how widespread interning is in practice. It also only wins with AppendOnly which is rarely used. For TraceTids: BM_HashMap_TraceTids<Ours_Tid> 741954852 ns 741389092 ns 1 183.063M/s 16.099k BM_HashMap_TraceTids<OursV2_Tid> 504387549 ns 503475605 ns 1 269.568M/s 16.099k BM_HashMap_TraceTids<StdUnorderedMap_Tid> 1401707453 ns 1400083835 ns 1 96.9378M/s 16.099k BM_HashMap_TraceTids<AbslFlatHashMap_Tid> 510576964 ns 510105976 ns 1 266.064M/s 16.099k We're signifcantly better than V1 here: tids is exactly the sort of "general behaviour" which it's very common to see in Perfetto and where V1 behaves less well than I'd like. For LookupRandInts: BM_HashMap_LookupRandInts<Ours_Default> 574181027 ns 573859405 ns 1 17.4259M/s 10.7378P BM_HashMap_LookupRandInts<OursV2_Default> 350956969 ns 350711465 ns 2 28.5135M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_Default> 571851786 ns 571499561 ns 1 17.4978M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_Default> 366032708 ns 365819315 ns 2 27.3359M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_Murmur> 611111313 ns 610638342 ns 1 16.3763M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_Murmur> 363292801 ns 362867428 ns 2 27.5583M/s 10.7378P BM_HashMap_LookupRandInts<Ours_PreHashed> 713060262 ns 712048102 ns 1 14.044M/s 10.7378P BM_HashMap_LookupRandInts<OursV2_PreHashed> 329546786 ns 329117625 ns 2 30.3843M/s 10.7378P BM_HashMap_LookupRandInts<StdUnorderedMap_PreHashed> 580294840 ns 579547604 ns 1 17.2548M/s 10.7378P BM_HashMap_LookupRandInts<AbslFlatHashMap_PreHashed> 340158501 ns 339918114 ns 2 29.4188M/s 10.7378P Another handy win which appears regardless of whether the hash function is good or not. Notice how in the V1 case, prehashed causes poor degeneration of the function but V2 is actually *faster*. For InsertCollidingInts: BM_HashMap_InsertCollidingInts<Ours_Default> 701989492 ns 701578722 ns 1 14.2536M/s BM_HashMap_InsertCollidingInts<OursV2_Default> 427725299 ns 427302016 ns 2 23.4027M/s BM_HashMap_InsertCollidingInts<StdUnorderedMap_Default> 896047311 ns 895384570 ns 1 11.1684M/s BM_HashMap_InsertCollidingInts<AbslFlatHashMap_Default> 420245635 ns 420006692 ns 2 23.8091M/s Another big win for V2. Another similar example to the sort we'd see in practice in Perfetto. For InsertDupeInts: BM_HashMap_InsertDupeInts<Ours_Default> 63926278 ns 63874785 ns 11 156.556M/s BM_HashMap_InsertDupeInts<OursV2_Default> 40291231 ns 40268380 ns 17 248.334M/s BM_HashMap_InsertDupeInts<StdUnorderedMap_Default> 37554643 ns 37527602 ns 19 266.471M/s BM_HashMap_InsertDupeInts<AbslFlatHashMap_Default> 29026714 ns 29008928 ns 24 344.721M/s BM_HashMap_InsertDupeInts<AbslFlatHashMap_Murmur> 38883999 ns 38857011 ns 18 257.354M/s Same as above, another common case. Surprisingly, std does very well here. And we seem to be bottlenecked on the hash function given how much better absl does with its own hash vs murmur (its hash is slightly lower quality but also faster). This is the only benchmark where this happens thoough, others don't show this behaviour. For *String: BM_HashMap_HeterogeneousLookup_String<Ours_String> 575812136 ns 575383946 ns 1 17.3797M/s BM_HashMap_HeterogeneousLookup_String<OursV2_String> 420531571 ns 420120838 ns 2 23.8027M/s BM_HashMap_RegularLookup_String<Ours_String> 820167931 ns 819494830 ns 1 12.2026M/s BM_HashMap_RegularLookup_String<OursV2_String> 405975726 ns 405482422 ns 2 24.662M/s BM_HashMap_RegularLookup_String<StdUnorderedMap_String> 1577833963 ns 1576585547 ns 1 6.34282M/s BM_HashMap_RegularLookup_String<AbslFlatHashMap_String> 491132758 ns 490802459 ns 2 20.3748M/s BM_HashMap_RegularLookup_String<AbslFlatHashMap_String_Murmur> 571520036 ns 571134272 ns 1 17.509M/s All clear wins for us. We're doing actually much better than even absl here! InsertVaryingSizes: BM_HashMap_InsertVaryingSize<Ours_Default>/100 754 ns 754 ns 932167 132.677M/s BM_HashMap_InsertVaryingSize<Ours_Default>/10000 200216 ns 200096 ns 3531 49.9759M/s BM_HashMap_InsertVaryingSize<Ours_Default>/1000000 54744741 ns 54674135 ns 13 18.2902M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/100 1381 ns 1380 ns 505094 72.4606M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/10000 164971 ns 164851 ns 4250 60.6608M/s BM_HashMap_InsertVaryingSize<OursV2_Default>/1000000 38351821 ns 38310617 ns 18 26.1024M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/100 1268 ns 1267 ns 551576 78.9371M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/10000 137933 ns 137845 ns 5065 72.5454M/s BM_HashMap_InsertVaryingSize<AbslFlatHashMap_Default>/1000000 39980643 ns 39943874 ns 18 25.0351M/s Note how v1 does better for very small maps but drops off as maps get bigger. I think this is probably the effect of being able to have everything in cache even with the different key/value/ctrl maps and SIMD not being as useful with such small maps. LookupWithMisses: BM_HashMap_LookupWithMisses<Ours_Default>/0 359571712 ns 359090314 ns 2 27.8481M/s BM_HashMap_LookupWithMisses<Ours_Default>/50 426399559 ns 425828348 ns 2 23.4836M/s BM_HashMap_LookupWithMisses<Ours_Default>/100 308777908 ns 308628319 ns 2 32.4014M/s BM_HashMap_LookupWithMisses<OursV2_Default>/0 195633544 ns 195481786 ns 4 51.1557M/s BM_HashMap_LookupWithMisses<OursV2_Default>/50 221380263 ns 221180357 ns 3 45.212M/s BM_HashMap_LookupWithMisses<OursV2_Default>/100 139350634 ns 139236152 ns 5 71.8204M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/0 201567396 ns 201412429 ns 4 49.6494M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/50 206378283 ns 206259753 ns 3 48.4826M/s BM_HashMap_LookupWithMisses<AbslFlatHashMap_Default>/100 123735273 ns 123601106 ns 6 80.9054M/s Notice how much much better we do when we have a low/high miss rate. This is indicative of the branch predicted doing a much better job of actually dealing wit our usecases. But OTOH even 50% we are far suprerior. InsertSequential: BM_HashMap_InsertSequentialInts<Ours_Default> 781732694 ns 781402467 ns 1 12.7975M/s BM_HashMap_InsertSequentialInts<OursV2_Default> 423644733 ns 423098276 ns 2 23.6352M/s BM_HashMap_InsertSequentialInts<StdUnorderedMap_Default> 473308044 ns 472810618 ns 2 21.1501M/s BM_HashMap_InsertSequentialInts<AbslFlatHashMap_Default> 420184095 ns 419919524 ns 2 23.8141M/s Huge win for what is quite a common case in trace processor. Even std is doing better than us here which is quite surprising! LookupSequential: BM_HashMap_LookupSequentialInts<Ours_Default> 586390209 ns 585999772 ns 1 17.0649M/s BM_HashMap_LookupSequentialInts<OursV2_Default> 344488692 ns 344312148 ns 2 29.0434M/s BM_HashMap_LookupSequentialInts<StdUnorderedMap_Default> 49338334 ns 49299886 ns 15 202.84M/s BM_HashMap_LookupSequentialInts<AbslFlatHashMap_Default> 230591313 ns 230421841 ns 3 43.3987M/s Std destroys everyone which I find highly surprising and don't have a good explanation for. Maybe it's because it has a perfect hash function basically (since by default std::hash just returns the int?). Regardless we're doing better than V1.
🎨 Perfetto UI Build✅ UI build is ready: https://storage.googleapis.com/perfetto-ci-artifacts/gh-20718088702-1-ui/ui/index.html |
This reverts commit 44c8b9a.
primiano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG with various comments
| size_t key_hash = Hasher{}(key); | ||
| uint8_t h2 = H2(key_hash); | ||
| FindResult res = FindInternal<true>(key, key_hash, h2); | ||
| if (PERFETTO_UNLIKELY(res.needs_insert)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop the UNLIKELY here. if anything, this is the likely case. but i'd just leave it without any hint.
Also a bit more readable if you first get out of the way the unlikely case, which is the opposite:
if (PERFETTO_UNLIKELY(!res.needs_insert)) {
return {&slots_[res.idx].value, false};
}
// here you have the actual insertion logic
| Iterator GetIterator() { return Iterator(this); } | ||
| const Iterator GetIterator() const { return Iterator(this); } | ||
| void Clear() { | ||
| if (PERFETTO_UNLIKELY(capacity_ == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd retain the original comment we had in v1 as it's quite useful:
// Avoid trivial heap operations on zero-capacity std::move()-d objects.
|
|
||
| // Non-templated base class to hold helpers for FlatHashMap. | ||
| struct FlatHashMapBase { | ||
| class FlatHashMapV2Base { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the benefit of this class? there is no .cc file, so no type erasion needed. it feels like this holds only some constants that could be just put in the internal namespace
| for (; idx_ < map_->capacity_; idx_++) { | ||
| if (tags[idx_] != kFreeSlot && (AppendOnly || tags[idx_] != kTombstone)) | ||
| for (; ctrl_ != ctrl_end_; ++ctrl_, ++slots_) { | ||
| if (*ctrl_ != kFreeSlot && *ctrl_ != kTombstone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm never sure how much the compiler can optimize on pointer aliasing, especially when using uint8_t (Which has special rules).
Imho a bit better if you do:
const uint8_t cur_ctl = *ctrl_;
if (cur_ctl != kFreeSlot && cur_ctl != kTombstone)
return
| size_t idx_ = 0; | ||
| const uint8_t* ctrl_ = nullptr; | ||
| const uint8_t* ctrl_end_ = nullptr; | ||
| Slot* slots_ = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/slots_/slot_/ not sure why should be plural, I found a bit confusing as I was expecting (As it is) just a pointer to the current slot
| // Iterator for sparse 64-bit mask | ||
| struct Iterator { | ||
| public: | ||
| PERFETTO_ALWAYS_INLINE Iterator(uint64_t mask) : mask_(mask) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+explicit
| keys_ = AlignedAllocTyped<Key[]>(n); // Deliberately not 0-initialized. | ||
| values_ = | ||
| AlignedAllocTyped<Value[]>(n); // Deliberately not 0-initialized. | ||
| size_t slots_offset = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comment here to explain the memory layout. Initially i thought there was a bug becuse storage_ itself is NOT aligned, but there isn't.
so the deal here is that the beignning of storage_ doesn't need to be aligned because the control bytes can be accessed at arbitary indexes.
but then the slots must be aligned. so the only alignment that matters is in the middle where the slots start.
but without this explanation, this 100% looks like an alignment bug masked by the fact that the allocator always gives you 16 bytes alignment even if you dont' ask for it
| PERFETTO_ALWAYS_INLINE void SetCtrl(size_t i, uint8_t h) { | ||
| ctrl_[i] = h; | ||
| // Update clone if this is one of the first kNumClones entries | ||
| if (i < kNumClones) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PERFETTO_UNLIKELY
| AlignedUniquePtr<Value[]> values_; | ||
|
|
||
| // Slots remaining + has_deleted flag | ||
| GrowthInfo growth_info_{0, 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity is putting the "has tombstones" in a bit in the same word something that ABSL does?
feels a bit odd, espexially considering that it requires a bitmasking on every access.
(but if they did i guess they had good reasons for it)
| } | ||
| } | ||
|
|
||
| // Owns the actual memory: [ctrl[capacity]][clones[15]][slots[capacity]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho better if you say:
// Memory layout:
//
// [Control bytes]
// |capacity| bytes for control bytes.
// 15 bytes for control bytes clones (*) (add a comment in appendix)
// No alignment required (Accessed at random byte offsets)
//
// [padding for Slots alignment]
//
// [Slots]
// Capacity * sizeof(Slot): contains key-value pairs
// Must be aligned at alignof(Slot)
Note: this CL does not change any production code to actually use
this. All it's doing is adding the implementation, benchmarks and tests.
Future CL will actually do the switch.
Our current FlatHashMap implementation has seved us well.
It was originally designed just to replace std::unordered_map inside of
TraceProcessor's StringPool but due to its nice perf characteristics,
it's come to be used throughout the Perfetto codebase.
But over time I've realised that it has some flows which make it not
suitable to be a truly general purpose hash table:
rehash if there are tombstones but in heavy insert/erase workloads
it's not enough
quadratic probing, which can cause some quite sub-optimal memory
access patterns. And using linear probing doesn't work for high load
factor hash maps.
arrays means that we're jumping around memory instead of being in a
single place, especially important for small maps.
All in all, these are all things that absl::flat_has_map and Swiss
tables have solved. There's no reason for us not to take advantage of at
least their memory layout and orgniaation ideas even if we cannot use
them directly. If they're good enough for much of the C++ world (and
definitely all of Google) they should be good enough for us too!
So this CL adds a new FlatHashMapV2 which is based on Swiss Table
concepths. Specifically;
array
probing.
hashes.
checks and reduce the probability of a false match to 1/128 before
even checking keys
optimize this with NEON in the future depending on what we see)
The benchmarks are quite good for general cases.
For TraceStrings:
This is the only case where the old implementation wins (which makes
sense because it was specifically designed for it!). But I don't think a
loss here justifies the major wins elsewhere (especially also because in
the "normal" case, cache misses would drown out the small differences
here anyway given how widespread interning is in practice. It also only
wins with AppendOnly which is rarely used.
For TraceTids:
We're signifcantly better than V1 here: tids is exactly the sort of
"general behaviour" which it's very common to see in Perfetto and where
V1 behaves less well than I'd like.
For LookupRandInts:
Another handy win which appears regardless of whether the hash function
is good or not. Notice how in the V1 case, prehashed causes poor
degeneration of the function but V2 is actually faster.
For InsertCollidingInts:
Another big win for V2. Another similar example to the sort we'd see in
practice in Perfetto.
For InsertDupeInts:
Same as above, another common case. Surprisingly, std does very well
here. And we seem to be bottlenecked on the hash function given how much
better absl does with its own hash vs murmur (its hash is slightly lower
quality but also faster). This is the only benchmark where this happens
thoough, others don't show this behaviour.
For *String:
All clear wins for us. We're doing actually much better than even absl
here!
InsertVaryingSizes:
Note how v1 does better for very small maps but drops off as maps get
bigger. I think this is probably the effect of being able to have
everything in cache even with the different key/value/ctrl maps and SIMD
not being as useful with such small maps.
LookupWithMisses:
Notice how much much better we do when we have a low/high miss rate.
This is indicative of the branch predicted doing a much better job of
actually dealing wit our usecases. But OTOH even 50% we are far
suprerior.
InsertSequential:
Huge win for what is quite a common case in trace processor. Even std is
doing better than us here which is quite surprising!
LookupSequential:
Std destroys everyone which I find highly surprising and don't have a
good explanation for. Maybe it's because it has a perfect hash function
basically (since by default std::hash just returns the int?). Regardless
we're doing better than V1.