Talk about Metadata for BloomFilter #1701

mapleFU · 2023-08-25T06:26:37Z

mapleFU
Aug 25, 2023
Collaborator

To be briefly, we're plan to support RedisBloom[1]. We plan to use BlockSplitBloomFilter as the underlying BloomFilter[2], because it's widely used by database by database systems like RocksDB[3], Impala, Kudu and other systems, it was merged in kvrocks.

Now we plan to support RedisBloom, the original redis bloom has two-level:

SBChain: a chain of in-memory bloom filter
Bloom: BloomFilter implemention

We can easily support this, however, we need to design our metadata carefully.

Draft1

Currently, the draft is:

ChainMetadata: (num_filters, scaling, expansion)
BFMetadata: (num_distinct, size, false-positive-rate)

For ChainMetadata:

num_filters is the number of underlying BFMetadata. It would be initialized as "1". And it can grow when "scaling" enabled.
scaling and expansion: Adding an element to a Bloom filter never fails due to the data structure "filling up". Instead the error rate starts to grow. To keep the error close to the one set on filter initialisation - the bloom filter will auto-scale, meaning when capacity is reached an additional sub-filter will be created. The size of the new sub-filter is the size of the last sub-filter multiplied by EXPANSION. The default expansion value is 2.

For BFMetadata:

num_distinct: the distinct value in this bloomFilter
size: the size of bytes for bloom filter
false-positive-rate: fpp for filter.

During reading:

Get ChainMetadata
Get all BFMetadata
searching from largest to smallest ( this keeps same as RedisBloom, personally I think we can search from lowest to highest)

During writing:

Get ChainMetadata
Get all BFMetadata
searching from largest to smallest ( this keeps same as RedisBloom, personally I think we can search from lowest to highest), if exists, do nothing
Else, try to insert to largest filter. If last filter is not full, we can insert to last one. Else we need to allocate a new filter.

Cons

Need two "metadata", which might a bit complex. And during "add", they should all be updated.

Pros

Flexible, we can separate the bf and bf-chain metadata.

Draft2

ChainMetadata: (num_filters, scaling, expansion, base_size, sum_num_distinct, false-positive-rate)

Other arguments is same as previous one. However, we can deduce num of each filter by base_size and (scaling, expansion). This avoid a BFMetadata

Cons

Not so easy to evolution the underlying BF format.

Pros

Only one metadata, is convinient

References

[1] https://redis.io/docs/stack/bloom/
[2] https://github.com/apache/parquet-format/blob/master/BloomFilter.md
[3] https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#full-filters-new-format

mapleFU · 2023-08-25T06:27:00Z

mapleFU
Aug 25, 2023
Collaborator Author

cc @git-hulk @zncleon @PragmaTwice Would you mind take a look?

2 replies

git-hulk Aug 25, 2023
Collaborator

Hi @mapleFU

I'm quite sure if understand your proposal correctly(help to correct me if I was wrong):

BFMetadata is used to store each BloomFilter layer metadata. That's to say, I will have only one BFMetadata at the beginning and this BFMetadata represents the layer-0's metadata. Then it will have two BFMetadata if layer-0 is not large enough to maintain a specified false positive rate, and so on.
ChainMetadata only records if the BloomFilter should be scaling, its expansion, and filters represent how many BloomFilter slices are in this key.

If I understand your proposal correctly, then I prefer solution 2. Since its metadata won't be large and it can save time for fetching the BFMedata.

mapleFU Aug 25, 2023
Collaborator Author

Yes. You're right. Solution 2 means we can deduce the size of bloom filter since it scale would be size -> size * scale -> size * scale^2.... And the false-positive rate is unified for all BloomFilters.

I prefer solution 2 too here

zncleon · 2023-08-25T10:48:15Z

zncleon
Aug 25, 2023

I also prefer solution 2. Since it‘s not so good to put BFMedata in which column family and so many bf_meta_key in code is annoying.

0 replies

mapleFU · 2023-08-27T07:59:50Z

mapleFU
Aug 27, 2023
Collaborator Author

Close as we choose "2" here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Talk about Metadata for BloomFilter #1701

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Talk about Metadata for BloomFilter #1701

Uh oh!

Uh oh!

mapleFU Aug 25, 2023 Collaborator

Draft1

Cons

Pros

Draft2

Cons

Pros

References

Replies: 3 comments · 2 replies

Uh oh!

mapleFU Aug 25, 2023 Collaborator Author

Uh oh!

git-hulk Aug 25, 2023 Collaborator

Uh oh!

Uh oh!

mapleFU Aug 25, 2023 Collaborator Author

Uh oh!

zncleon Aug 25, 2023

Uh oh!

mapleFU Aug 27, 2023 Collaborator Author

mapleFU
Aug 25, 2023
Collaborator

Replies: 3 comments 2 replies

mapleFU
Aug 25, 2023
Collaborator Author

git-hulk Aug 25, 2023
Collaborator

mapleFU Aug 25, 2023
Collaborator Author

zncleon
Aug 25, 2023

mapleFU
Aug 27, 2023
Collaborator Author