Skip to content

Commit e54656b

Browse files
authored
0.0.4 (#12)
* more documentation * easier query type setting to count-mean and count-mean-min sketches
1 parent e61d529 commit e54656b

File tree

9 files changed

+387
-94
lines changed

9 files changed

+387
-94
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
* Bloom Filter
66
* Bloom Filter (on disk)
77
* Count-Min Sketch
8+
* Count-Mean Sketch
9+
* Count-Mean-Min Sketch
810
* Heavy Hitters
911
* Stream Threshold
1012
* Import and export of each

README.rst

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ PyProbables
55
is to provide the developer with a pure-python implementation of common
66
probabilistic data-structures to use in their work.
77

8+
89
Installation
910
------------------
1011

@@ -25,14 +26,14 @@ To install `pyprobables`, simply clone the `repository on GitHub
2526

2627
`pyprobables` supports python versions 2.7 and 3.3 - 3.6
2728

29+
2830
API Documentation
2931
---------------------
3032

31-
Documentation is currently under development. The documentation of
32-
the latest release will be hosted on
33-
`readthedocs.io <http://pyprobables.readthedocs.io/en/stable/?>`__
33+
The documentation of is hosted on
34+
`readthedocs.io <http://pyprobables.readthedocs.io/en/latest/code.html#api>`__
3435

35-
Once completed, you can build the documentation yourself by running:
36+
You can build the documentation yourself by running:
3637

3738
::
3839

@@ -53,6 +54,7 @@ downloaded folder:
5354
$ python setup.py test
5455

5556

57+
5658
Quickstart
5759
------------------
5860

@@ -76,8 +78,11 @@ Import pyprobables and setup a Count-Min Sketch:
7678
>>> cms.add('google.com') # should return 1
7779
>>> cms.add('facebook.com', 25) # insert 25 at once; should return 25
7880
79-
See the documentation for other data structures available and for more
80-
examples!
81+
See the `API documentation <http://pyprobables.readthedocs.io/en/latest/code.html#api>`__
82+
for other data structures available and the
83+
`quickstart page <http://pyprobables.readthedocs.io/en/latest/quickstart.html#quickstart>`__
84+
for more examples!
85+
8186

8287
Changelog
8388
------------------

docs/source/code.rst

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ pyprobables API
44
***************
55

66
Here you can find the full developer API for the pyprobables project.
7+
pyprobables provides a suite of probabilistic data-structures to be used
8+
in data analytics and data science projects.
79

8-
Contents:
9-
=========
1010

1111
.. toctree::
12-
:maxdepth: 3
12+
:maxdepth: 4
1313

1414
code
1515

@@ -20,8 +20,10 @@ Data Structures and Classes
2020
Bloom Filters
2121
-------------
2222

23-
Bloom Filters are a class of probabilistic data structures that guarantee a
24-
zero percent false negative rate and a predetermined false positive rate.
23+
Bloom Filters are a class of probabilistic data structures used for set
24+
operations. Bloom Filters guarantee a zero percent false negative rate
25+
and a predetermined false positive rate.
26+
2527

2628
BloomFilter
2729
+++++++++++++++++++++++++++++++
@@ -41,33 +43,61 @@ BloomFilterOnDisk
4143
Count-Min Sketches
4244
------------------
4345

46+
Count-Min Sketches are a class of probabilistic data structures designed to
47+
count the number of occurrences of data elements in data streams.
48+
49+
4450
CountMinSketch
4551
+++++++++++++++++++++++++++++++
4652

4753
.. autoclass:: probables.CountMinSketch
4854
:members:
4955

5056

57+
CountMeanSketch
58+
+++++++++++++++++++++++++++++++
59+
60+
.. autoclass:: probables.CountMeanSketch
61+
:members:
62+
63+
64+
CountMeanMinSketch
65+
+++++++++++++++++++++++++++++++
66+
67+
.. autoclass:: probables.CountMeanMinSketch
68+
:members:
69+
70+
5171
HeavyHitters
5272
+++++++++++++++++++++++++++++++
5373

5474
.. autoclass:: probables.HeavyHitters
5575
:members:
5676
:inherited-members:
5777

78+
5879
StreamThreshold
5980
+++++++++++++++++++++++++++++++
6081

6182
.. autoclass:: probables.StreamThreshold
6283
:members:
6384
:inherited-members:
6485

86+
6587
Exceptions
6688
===============================
6789

6890
.. automodule:: probables.exceptions
6991
:members:
7092

93+
94+
Hashing Functions
95+
===============================
96+
97+
.. automodule:: probables.hashes
98+
:members:
99+
100+
71101
Indices and tables
72102
==================
73103

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Read More
55
==================
66

77
* :ref:`api`
8+
* :ref:`quickstart`
89
* :ref:`genindex`
910
* :ref:`modindex`
1011
* :ref:`search`

docs/source/quickstart.rst

Lines changed: 177 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,187 @@
11
.. _quickstart:
22

33
pyprobables Quickstart
4-
======================
4+
######################
55

66
.. toctree::
7-
:maxdepth: 2
8-
:caption: Quickstart:
7+
:maxdepth: 5
98

109
quickstart
1110

1211

1312
Install
14-
^^^^^^^
13+
**************************
14+
15+
The easiest method of installing pyprobables is by using the pip package
16+
manager:
17+
18+
Pip Installation:
19+
20+
::
21+
22+
$ pip install pyprobables
23+
24+
25+
API Documentation
26+
**************************
27+
28+
The full API documentation for the pyprobables package: :ref:`api`
29+
30+
Example Usage
31+
**************************
32+
33+
Bloom Filters
34+
==========================
35+
36+
Bloom Filters provide set operations of large datasets while being small in
37+
memory footprint. They provide a zero percent false negative rate and a
38+
predetermined, or desired, false positive rate.
39+
`more information <https://en.wikipedia.org/wiki/Bloom_filter>`__
40+
41+
Import, Initialize, and Train
42+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43+
44+
.. code:: python
45+
46+
>>> from probables import (BloomFilter)
47+
>>> blm = BloomFilter(est_elements=1000000, false_positive_rate=0.05)
48+
>>> with open('war_and_peace.txt', 'r') as fp:
49+
>>> for line in fp:
50+
>>> for word in line.split():
51+
>>> blm.add(word.lower()) # add each word to the bloom filter!
52+
>>> # end reading in the file
53+
54+
Query the Bloom Filter
55+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
56+
.. code:: python
57+
58+
>>> words_to_check = ['step', 'borzoi', 'diametrically', 'fleches', 'rain']
59+
>>> for word in words_to_check:
60+
>>> blm.check(word)
61+
62+
63+
Export the Bloom Filter
64+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
65+
.. code:: python
66+
67+
>>> blm.export('war_and_peace_bloom.blm')
68+
69+
70+
Import a Bloom Filter
71+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
72+
.. code:: python
73+
74+
>>> blm2 = BloomFilter(filepath='war_and_peace_bloom.blm')
75+
>>> print(blm2.check('sutler'))
76+
77+
78+
Other Bloom Filters
79+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
80+
81+
Bloom Filter on Disk
82+
"""""""""""""""""""""""""""""""""""""""""""""""
83+
84+
The **Bloom Filter on Disk** is a specialized version of the standard
85+
Bloom Filter that is run directly off of disk instead of in memory. This
86+
can be useful for very large Bloom Filters or when needing to access many
87+
Blooms that are exported to file.
88+
89+
90+
Counting Bloom Filter
91+
"""""""""""""""""""""""""""""""""""""""""""""""
92+
93+
**Counting Bloom Filters** are another specialized version of the standard
94+
Bloom Filter. Instead of using a bit array to track added elements, a
95+
Counting Bloom uses integers to track the number of times the element has
96+
been added. **currently not supported; planned**
97+
98+
99+
Count-Min Sketch
100+
==========================
101+
102+
Count-Min Sketches, and its derivatives, are good for counting the number of
103+
occurrences of an element in streaming data while not needing to retain all the
104+
data elements. The result is a probabilistic count of elements inserted into
105+
the data structure. It will always provide a **maximum** number of times
106+
encountered. Notice that the result may be **more** than the true number
107+
of times it was inserted, but never fewer.
108+
`more information <https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch>`__
109+
110+
111+
Import, Initialize, and Train
112+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
113+
114+
.. code:: python
115+
116+
>>> from probables import (CountMinSketch)
117+
>>> cms = CountMinSketch(width=100000, depth=5)
118+
>>> with open('war_and_peace.txt', 'r') as fp:
119+
>>> for line in fp:
120+
>>> for word in line.split():
121+
>>> cms.add(word.lower()) # add each to the count-min sketch!
122+
123+
124+
Query the Count-Min Sketch
125+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
126+
127+
.. code:: python
128+
129+
>>> words_to_check = ['step', 'borzoi', 'diametrically', 'fleches', 'rain']
130+
>>> for word in words_to_check:
131+
>>> print(cms.check(word)) # prints: 80, 17, 1, 20, 25
132+
133+
134+
Export Count-Min Sketch
135+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
136+
137+
.. code:: python
138+
139+
>>> cms.export('war_and_peace.cms')
140+
141+
142+
Import a Count-Min Sketch
143+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
144+
.. code:: python
145+
146+
>>> cms2 = CountMinSketch(filepath='war_and_peace.cms')
147+
>>> print(cms2.check('fleches')) # prints 20
148+
149+
150+
Other Count-Min Sketches
151+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
152+
153+
Count-Mean Sketch and Count-Mean-Min Sketch
154+
"""""""""""""""""""""""""""""""""""""""""""""""
155+
156+
**Count-Mean Sketch** and **Count-Mean-Min Sketch** are identical to the
157+
Count-Min Sketch for the data structure but both differ in the method of
158+
calculating the number of times and element has been inserted. These are
159+
currently supported by specifying at query time which method is desired
160+
or by initializing to the desired class: CountMeanSketch or CountMeanMinSketch.
161+
162+
163+
Heavy Hitters
164+
"""""""""""""""""""""""""""""""""""""""""""""""
165+
166+
**Heavy Hitters** is a version of the Count-Min Sketch that tracks those
167+
elements that are seen most often. Beyond the normal initialization parameters
168+
one only needs to specify the number of heavy hitters to track.
169+
170+
171+
Stream Threshold
172+
"""""""""""""""""""""""""""""""""""""""""""""""
173+
174+
**Stream Threshold** is another version of the Count-Min Sketch similar to the
175+
Heavy Hitters. The main difference is that the there is a threshold for
176+
including an element to be tracked instead of tracking a certain number of
177+
elements.
178+
179+
180+
Indices and tables
181+
==================
182+
183+
* :ref:`home`
184+
* :ref:`api`
185+
* :ref:`genindex`
186+
* :ref:`modindex`
187+
* :ref:`search`

probables/__init__.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,19 @@
11
''' pyprobables module '''
22
from __future__ import (unicode_literals, absolute_import, print_function)
33
from .blooms import (BloomFilter, BloomFilterOnDisk)
4-
from .countminsketch import (CountMinSketch, HeavyHitters, StreamThreshold)
4+
from .countminsketch import (CountMinSketch, HeavyHitters, StreamThreshold,
5+
CountMeanSketch, CountMeanMinSketch)
56
from .exceptions import (InitializationError, NotSupportedError,
67
ProbablesBaseException)
78

89
__author__ = 'Tyler Barrus'
910
__maintainer__ = 'Tyler Barrus'
1011
__email__ = '[email protected]'
1112
__license__ = 'MIT'
12-
__version__ = '0.0.3'
13+
__version__ = '0.0.4'
1314
__credits__ = []
1415
__url__ = 'https://github.com/barrust/pyprobables'
1516

1617
__all__ = ['BloomFilter', 'BloomFilterOnDisk', 'CountMinSketch',
17-
'HeavyHitters', 'StreamThreshold']
18+
'HeavyHitters', 'StreamThreshold', 'CountMeanSketch',
19+
'CountMeanMinSketch']

probables/countminsketch/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
''' count-min sketch submodule '''
22
from __future__ import (unicode_literals, absolute_import, print_function)
3-
from .countminsketch import (CountMinSketch, HeavyHitters, StreamThreshold)
3+
from .countminsketch import (CountMinSketch, HeavyHitters, StreamThreshold,
4+
CountMeanSketch, CountMeanMinSketch)
45

56

6-
__all__ = ['CountMinSketch', 'HeavyHitters', 'StreamThreshold']
7+
__all__ = ['CountMinSketch', 'HeavyHitters', 'StreamThreshold',
8+
'CountMeanSketch', 'CountMeanMinSketch']

0 commit comments

Comments
 (0)