support custom Q or R or P of BetaMinhash #14

yuhongye · 2020-02-24T12:10:34Z

Is there any plan to support custom Q or R or P of BetaMinhash?

sherifnada · 2020-02-24T17:29:27Z

There are currently no plans to support custom Q/R/P values for BetaMinhash. For that usecase, we recommend using HyperMinHash. Would you mind sharing a little bit about any usecases you have for which you'd like this feature?

Best,
Shrif

yuhongye · 2020-02-25T09:47:45Z

We want to take the intersection of multiple sets and be as accurate as possible, so we need a larger p.
I tested the accuracy of HyperMinHash and BetaMinHash in calculating the jaccard index, and found that BetaMinHash is better in the case p=14, q=6, r=10. In my knowledge, the accuracy of these two algorithms should be the same, so I went to the source code to find the difference between them, and found four different implementations：

calculate the position of the left-most 1-bit, BetaMinhash code is:

private static short getLeftmostOneBitPosition(boolean[] bits, int p, int q) {
    int _2toTheQ = (1 << q);

    // Notice: I think offset should be p, not p + 1, becaulse index start at 0
    int offset = p + 1; 
    for (int i = offset; i < _2toTheQ + offset; i++) {
      if (bits[i]) {
        return (short) (i + 1 - offset);
      }
    }
    return (short) (_2toTheQ + 1);
  }

and hyperminhash is

   final long zeroSearchSpace = (hllHash << p) | (long) (1 << (p - 1));
   final int leftmostOnePosition = Long.numberOfLeadingZeros(zeroSearchSpace) + 1;

I think BetaMinhash has a bug, I marked it in the comments of the code.

calculate the r bits, BetaMinhash get the rightmost r bit of hash[1], and HyperMinHash get the leftmost r bits of hash[1]
when calculate c in the algorithm 2.1.4 in the paper, BetaMinhash compare the whole register, but HyperMinhash only compare the mantissa.

// BetaMinhash
 for (BetaMinHash sketch : sketches) {
   itemInIntersection = itemInIntersection &&
      firstSketch.registers[i] == sketch.registers[i];
}

// HyperMinhash
for (HyperMinHash sketch : sketches) {
  itemInIntersection = itemInIntersection &&
      firstSketch.registers.getMantissaAtRegister(i) == sketch.registers.getMantissaAtRegister(i);
}

BetaMinhash use the expectedCollision algorithm in the paper when calculate similarity, but HyperMinHash does not.

I want to know the impact of these differences， Thank you! (Hope you can understand my english expression.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support custom Q or R or P of BetaMinhash #14

support custom Q or R or P of BetaMinhash #14

yuhongye commented Feb 24, 2020

sherifnada commented Feb 24, 2020

yuhongye commented Feb 25, 2020 •

edited

Loading

support custom Q or R or P of BetaMinhash #14

support custom Q or R or P of BetaMinhash #14

Comments

yuhongye commented Feb 24, 2020

sherifnada commented Feb 24, 2020

yuhongye commented Feb 25, 2020 • edited Loading

yuhongye commented Feb 25, 2020 •

edited

Loading