Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support custom Q or R or P of BetaMinhash #14

Open
yuhongye opened this issue Feb 24, 2020 · 2 comments
Open

support custom Q or R or P of BetaMinhash #14

yuhongye opened this issue Feb 24, 2020 · 2 comments

Comments

@yuhongye
Copy link

Is there any plan to support custom Q or R or P of BetaMinhash?

@sherifnada
Copy link
Contributor

Hi @yuhongye

There are currently no plans to support custom Q/R/P values for BetaMinhash. For that usecase, we recommend using HyperMinHash. Would you mind sharing a little bit about any usecases you have for which you'd like this feature?

Best,
Shrif

@yuhongye
Copy link
Author

yuhongye commented Feb 25, 2020

We want to take the intersection of multiple sets and be as accurate as possible, so we need a larger p.
I tested the accuracy of HyperMinHash and BetaMinHash in calculating the jaccard index, and found that BetaMinHash is better in the case p=14, q=6, r=10. In my knowledge, the accuracy of these two algorithms should be the same, so I went to the source code to find the difference between them, and found four different implementations:

  1. calculate the position of the left-most 1-bit, BetaMinhash code is:
private static short getLeftmostOneBitPosition(boolean[] bits, int p, int q) {
    int _2toTheQ = (1 << q);

    // Notice: I think offset should be p, not p + 1, becaulse index start at 0
    int offset = p + 1; 
    for (int i = offset; i < _2toTheQ + offset; i++) {
      if (bits[i]) {
        return (short) (i + 1 - offset);
      }
    }
    return (short) (_2toTheQ + 1);
  }

and hyperminhash is

   final long zeroSearchSpace = (hllHash << p) | (long) (1 << (p - 1));
   final int leftmostOnePosition = Long.numberOfLeadingZeros(zeroSearchSpace) + 1;

I think BetaMinhash has a bug, I marked it in the comments of the code.

  1. calculate the r bits, BetaMinhash get the rightmost r bit of hash[1], and HyperMinHash get the leftmost r bits of hash[1]

  2. when calculate c in the algorithm 2.1.4 in the paper, BetaMinhash compare the whole register, but HyperMinhash only compare the mantissa.

// BetaMinhash
 for (BetaMinHash sketch : sketches) {
   itemInIntersection = itemInIntersection &&
      firstSketch.registers[i] == sketch.registers[i];
}

// HyperMinhash
for (HyperMinHash sketch : sketches) {
  itemInIntersection = itemInIntersection &&
      firstSketch.registers.getMantissaAtRegister(i) == sketch.registers.getMantissaAtRegister(i);
}
  1. BetaMinhash use the expectedCollision algorithm in the paper when calculate similarity, but HyperMinHash does not.

I want to know the impact of these differences, Thank you! (Hope you can understand my english expression.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants