Skip to content

arashbm/hyperloglog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hll::hyperloglog: HyperLogLog++ with C++14 Actions Status

HyperLogLog is a probabilistic data structure that can help you estimate cardinality of very large multisets with a pre-determined accuracy using constant space.

Typical cardinality calculation methods, say using Pythons collections.Counter or using C++'s std::unordered_map, need O(n) space to count n unique elements. HyperLogLog on the other hand requires a constant O(1) amount of space for estimating cardinality of multisets with billions of items within a pre-determined range of accuracy. For example, you can use HyperLogLog to estimate the number of unique IP addresses that connect to your web server or the number of unique words in a book to within a percent of the actual value all with a few kilobytes of memory.

The tradeoff here is between the amount of constant space you allocate to the HyperLogLog data structure and the accuracy as determined by the relative error. The more registers you use, the more accurate your estimations are going to be. Each register is represented here by a uint8_t, although a maximum of 6 bits of each register is ever used. A HyperLogLog data structure with 2m registers has a relative error or 1.04/√m. This means that a HyperLogLog data structure with m=12 uses 212 registers (~4kB) and has a relative error of 1.6% for large multisets.

Installation

This package relies on cmake. Make sure a moderatly recent version (3.14 or newer) is already installed. You should also have a C++ compiler with C++14 support.

Here is how to build and run the tests:

$ git clone https://github.com/arashbm/hyperloglog.git
$ cd hyperloglog
$ mkdir build  # make directory to build in
$ cd build
$ cmake ..
$ cmake --build . --target hyperloglog_tests  # build the tests
$ ./hyperloglog_tests

At this point you should see "All tests passed" if all the steps are successful. You can continue by install the library on your system:

cmake --build . --target install

This final step might require admin previlages.

Example

// in "example.cpp"

#include <iostream>
#include <hll/hyperloglog.hpp>

int main() {
  hll::hyperloglog<std::size_t, 18, 25> h;

  for (std::size_t i = 1; i <= 10'000'000ul; i++)
    h.insert(i);

  std::cout << "Estimate is " << h.estimate() << std::endl;
  return 0;
}

Assuming you have installed the library as instructed above, you can compile and run example.cpp with:

$ g++ -std=c++14 -lhyperloglog -o example example.cpp
$ time ./example
Estimate is 1.00296e+07

real    0m0.669s
user    0m0.661s
sys     0m0.000s

On a relativly recent comodity CPU hll::hyperloglog enables you to insert and calculate cardinality of ten million items in less than a second.

See more examples of hll::hyperloglog in the tests located at src/tests.cpp.