Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better HLL estimator #17

Open
jianshu93 opened this issue Dec 11, 2023 · 8 comments
Open

better HLL estimator #17

jianshu93 opened this issue Dec 11, 2023 · 8 comments

Comments

@jianshu93
Copy link

Hello Team,

It seems that new cardinality estimator paper has two new estimators, the improved estimator and MLE estimator, better than the one originally proposed, see paper here: https://arxiv.org/pdf/1706.07290.pdf

Just wondering whether it worth trying, C++ version can be found in the paper.

Thank you,

Jianshu

@LucaCappelletti94
Copy link

@jianshu93 FYI, I implemented that estimator in my hyperloglog - it is still slow AF and only statistically better for what entails the union estimation. Maybe with better optimizers, one may get better performance, but when there is a need to compute logarithms in a core loop things can only go that fast.

@jianshu93
Copy link
Author

Hi @LucaCappelletti94,

Thanks for letting me know! Did you benchmarked against the MLE and improved estimator in Ertl's paper, as implemented in SourMash (https://docs.rs/sourmash/0.15.0/sourmash/sketch/hyperloglog/estimators/index.html)? This is a direct import of the C++ version from the original paper. It would be nice to have a benchmark and I would be happy to play with your implementation.

Best,

Jianshu

@LucaCappelletti94
Copy link

I am running several benchmarks, I can add also that one in. Thanks for pointing that one out.

@jianshu93
Copy link
Author

And also this one, hypertwobits(https://github.com/axiomhq/hypertwobits/tree/main), a very new one, paper was published just last month. I am a little bit concerned about it since the benchmarks showed very large variation even than the original hyperloglog. Let me know your thoughts on it also.

Jianshu

@LucaCappelletti94
Copy link

Hi @LucaCappelletti94,

Thanks for letting me know! Did you benchmarked against the MLE and improved estimator in Ertl's paper, as implemented in SourMash (https://docs.rs/sourmash/0.15.0/sourmash/sketch/hyperloglog/estimators/index.html)? This is a direct import of the C++ version from the original paper. It would be nice to have a benchmark and I would be happy to play with your implementation.

Best,

Jianshu

FYI, I have just read their joint estimation algo and it does not match the one from Otmar. I am quite sure of the correctness of my implementation since I bothered Otmar several times to check it together, so the only thing that remains to verify is which one is more accurate and their speeds.

@LucaCappelletti94
Copy link

And also this one, hypertwobits(https://github.com/axiomhq/hypertwobits/tree/main), a very new one, paper was published just last month. I am a little bit concerned about it since the benchmarks showed very large variation even than the original hyperloglog. Let me know your thoughts on it also.

Jianshu

Yeah, those results seem sus, especially for a 2-bit register thinghy. We'll see in a bit, currently adding to lineup.

@jianshu93
Copy link
Author

jianshu93 commented Aug 15, 2024

I would also be interested in UltraLogLog (https://dl.acm.org/doi/abs/10.14778/3654621.3654632) and ExaLogLog (https://arxiv.org/abs/2402.13726). But no implementation is available in Rust. All Ertl's implementation is in the hash4j java library, which I do not like. Since you have so many benchmarks there, I see it as a paper describing and available HLL-like implementations.

Jianshu

@LucaCappelletti94
Copy link

I would also be interested in UltraLogLog (https://dl.acm.org/doi/abs/10.14778/3654621.3654632) and ExaLogLog (https://arxiv.org/abs/2402.13726). But no implementation is available in Rust. All Ertl's implementation is in the hash4j java library, which I do not like.

Jianshu

Samesies on Java. A bit hard to do a fair comparison there and I do not have the time to re-implement it, at least, not now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants