MinHash

A simple but effective pure python implementation of Minhash + Banded LSH. Built for for finding the set of all similar strings in a large corpus in O(N) time. The entire dataset does not ever need to be in memory: the algorithm collects pairs in a single streaming pass.

Designed for "fuzzy matching" strings to account for misspellings, it could also be used for finding similar documents from a web crawl by splitting strings along word boundaries rather than character boundaries.

Running Test:

>>> python MinHash.py

Usage:

from MinHash import get_similar_docs

docs = ['aaaaab', 'aaaaac', 'xyz']
similar_docs = get_similar_docs(docs)
print similar_docs

> set([('aaaaab', 'aaaaac')])

Note that you may need to adjust the default parameters to best match your application.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
MinHash.py		MinHash.py
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MinHash

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

anthonygarvan/MinHash

Folders and files

Latest commit

History

Repository files navigation

MinHash

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages