-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an implementation for mutable bloom filters #673
base: develop
Are you sure you want to change the base?
Conversation
Build are failing because of an old ruby version in travis. Here is a fix #674. |
Thanks for the PR. I will take a look at it this weekend. One thing I have been thinking about is copying this code: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/BitSet.scala into the repo since that bitset is competitive with mutable (faster in some cases) but still immutable. Mutable data structures are really a last resort in this library. Even the priority queue case should be solved if you ask me since I just didn’t know about good immutable priority queues when I did the mutable package to add one. |
This implementation of BitSet looks very clever. I liked the idea and would go though it to understand the code as well. Thanks for sharing this. |
Codecov Report
@@ Coverage Diff @@
## develop #673 +/- ##
==========================================
+ Coverage 89.45% 89.5% +0.05%
==========================================
Files 113 114 +1
Lines 8944 9027 +83
Branches 490 496 +6
==========================================
+ Hits 8001 8080 +79
- Misses 943 947 +4
Continue to review full report at Codecov.
|
Hello @johnynek I tried a snapshot version of cats-collections here. I might be missing a few things here. It does improve compared to Cats Immutable BitSet with original BFHasher
I also noticed that This shows a significant improvement for fold / aggregate operations.
The create benchmarks are lower because it converts from BFSparse to BFInstance after adding quite a few elements in the |
Thanks for adding the benchmark with cats. One thing I'm thinking: maybe we don't need to use EWAH actually. With the tree-based cats-collection immutable bitset we may also not need BFSparse either, since the bitset is already internally sparse. |
Yes, I thought the same and wanted to remove EWAH, and I did bench mark that today morning when I was trying out cats. It provides a lot of improvement and a quick dirty way to check that was this (original code was this), which meant that when adding elements via foldLeft / aggregate, it would quickly convert from BFSparse => BFInstance. The above benchmark I posted is with and without this dirty change. Here is the benchmark for only Cats with original Hasher.
So moving from Scala Immutable + EWAH => Cats provides a lot of improvements, and is nearly same as Cats + EWAH with an early conversion to BFInstance from BFSparse (which is obvious as the underlying BitSet changes. |
Hi @johnynek I was having another idea about this. The BF Aggregator may internally use the mutable BloomFilter and in the This would take away a lot of complexity. In the current implementation we have mutability at the |
Hello @johnynek Were you able to take a look at this? |
👍 on the mutable backed BF & immutable |
This somehow fell off my radar, i'm sorry. I assume you have an internal implementation that you are happy with.
|
Adds an implementation of mutable Bloom Filters to
com.twitter.algebird.mutable
.This uses
java.util.BitSet
underneath, and the+=
and++=
operators mutatesthis
Bloom Filter when adding elements.This is significantly faster than the current BloomFilter implementation, and is useful when working with large BloomFIlters (> 1M elements) in one step (before we start reading from the filter). Since this is a mutable monoid, it doesn't have to copy the underlying bitset and can be more than 100 times faster. Scala BitSets are slower with loops for copying vs Java BitSet which uses
System.arraycopy
and hence I started experimenting with Java BitSet.This implementation is also suitable for cases where there are heavy queries. I've noticed improvements in query performance as well.
Hash
The hash functions used in this is of the form
h1 + i*h2
described in this paper. This also contributes to speeding up the creation and query of this filter compared to the immutable filter. This also makes the immutable and mutable filters incompatible, since their hash functions are different.Benchmarks
I've used a superset of
numberOfElements
andfpProb
for benchmarking these implementations:I've added a few other benchmarks via #672
Create
Immutable
Mutable BloomFilter with new Hash (the one in this PR)
I also tried to benchmark with old
BFHash
(I tried to see if it would be a good idea to make the two filters compatible)I don't think it is a good idea, because we are adding this as a new Bloom Filter and we don't want to deserialise immutable bloom filters into mutable filters. The performance improvements in these bench marks make sense to keep the new hashing functions.
Query
Immutable
Mutable (Because of new Hash functions)
I will try add some documentation for this as well, in a few days.