Releases: x-tabdeveloping/neofuzz
v0.3.0
New in version 0.3.0
Now you can reorder your search results using Levenshtein distance!
Sometimes n-gram processes or vectorized processes don't quite order the results correctly.
In these cases you can retrieve a higher number of examples from the indexed corpus, then refine those results with Levenshtein distance.
This gives you the speed of Neofuzz, with the accuracy of TheFuzz :D
from neofuzz import char_ngram_process
process = char_ngram_process()
process.index(corpus)
top_5 = process.extract("your query", limit=30, refine_levenshtein=True)[:5]v0.2.0
1. Added subword tokenization
If you intend to use subword features, that are more informative than character n-grams, you can now do so.
I've introduced a new vectorizer component that can utilise pretrained tokenizers from language models for feature extraction.
Example code:
from neofuzz import Process
from neofuzz.tokenization import SubWordVectorizer
# We can use bert's wordpiece tokenizer for feature extraction
vectorizer = SubWordVectorizer("bert-base-uncased")
process = Process(vectorizer, metric="cosine")2. Added code for persisting processes
You might want to persist processes to disk and reuses them in production pipelines.
Neofuzz can now serialize indexed Process objects for you using joblib.
You can save indexed processes like so:
from neofuzz import char_ngram_process
from neofuzz.tokenization import SubWordVectorizer
process = char_ngram_process()
process.index(corpus)
process.to_disk("process.joblib")And then load them in a production environment:
from neofuzz import Process
process = Process.from_disk("process.joblib")