Skip to content

anthonygarvan/MinHash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MinHash

A simple but effective pure python implementation of Minhash + Banded LSH. Built for for finding the set of all similar strings in a large corpus in O(N) time. The entire dataset does not ever need to be in memory: the algorithm collects pairs in a single streaming pass.

Designed for "fuzzy matching" strings to account for misspellings, it could also be used for finding similar documents from a web crawl by splitting strings along word boundaries rather than character boundaries.

Running Test:

>>> python MinHash.py

Usage:

from MinHash import get_similar_docs

docs = ['aaaaab', 'aaaaac', 'xyz']
similar_docs = get_similar_docs(docs)
print similar_docs

> set([('aaaaab', 'aaaaac')])

Note that you may need to adjust the default parameters to best match your application.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages