Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions on bottom-sketch #30

Open
jianshu93 opened this issue Sep 21, 2023 · 0 comments
Open

questions on bottom-sketch #30

jianshu93 opened this issue Sep 21, 2023 · 0 comments

Comments

@jianshu93
Copy link

Dear Anndrii,

I am writing to ask about the actually implementation of MinHash, there are 3 different implementations, k-min sketch, bottom-k sketch, which is:

a ‘bottom sketch’ strategy, originally proposed by Broder (1997), is commonly used to implement the MinHash algorithm, Instead of using m hash functions, all elements from a given set A are passed through a single hash function and the smallest m hash values (instead of elements) are stored in a sorted sketch S_b (A) of size m. The probability that sketch S_b(A), S_b(B) share a hash value represents the probability of random sampling a shared element from the union of set A and B. So, the resemblance of set A, B can be quickly estimated by counting the matched values between S_b(A) and S_b(B). The idea is mentioned here: http://www.cohenwang.org/edith/Surveys/minhash.pdf

This idea will increase the variance of the estimator right because only one hash function is used. More hash functions used, it will be more accurate. What to you think.

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant