v0.2.0
1. Added subword tokenization
If you intend to use subword features, that are more informative than character n-grams, you can now do so.
I've introduced a new vectorizer component that can utilise pretrained tokenizers from language models for feature extraction.
Example code:
from neofuzz import Process
from neofuzz.tokenization import SubWordVectorizer
# We can use bert's wordpiece tokenizer for feature extraction
vectorizer = SubWordVectorizer("bert-base-uncased")
process = Process(vectorizer, metric="cosine")2. Added code for persisting processes
You might want to persist processes to disk and reuses them in production pipelines.
Neofuzz can now serialize indexed Process objects for you using joblib.
You can save indexed processes like so:
from neofuzz import char_ngram_process
from neofuzz.tokenization import SubWordVectorizer
process = char_ngram_process()
process.index(corpus)
process.to_disk("process.joblib")And then load them in a production environment:
from neofuzz import Process
process = Process.from_disk("process.joblib")