Skip to content

v0.2.0

Choose a tag to compare

@x-tabdeveloping x-tabdeveloping released this 21 May 10:05
· 25 commits to main since this release

1. Added subword tokenization

If you intend to use subword features, that are more informative than character n-grams, you can now do so.
I've introduced a new vectorizer component that can utilise pretrained tokenizers from language models for feature extraction.

Example code:

from neofuzz import Process
from neofuzz.tokenization import SubWordVectorizer

# We can use bert's wordpiece tokenizer for feature extraction
vectorizer = SubWordVectorizer("bert-base-uncased")
process = Process(vectorizer, metric="cosine")

2. Added code for persisting processes

You might want to persist processes to disk and reuses them in production pipelines.
Neofuzz can now serialize indexed Process objects for you using joblib.

You can save indexed processes like so:

   from neofuzz import char_ngram_process
   from neofuzz.tokenization import SubWordVectorizer
 
   process = char_ngram_process()
   process.index(corpus)
 
   process.to_disk("process.joblib")

And then load them in a production environment:

   from neofuzz import Process
 
   process = Process.from_disk("process.joblib")