A simple Amharic text tokenizer with real-time colorized visualization of tokens. Supports multiple tokenization modes including server-side SentencePiece, local BPE-like, character, whitespace, and random subword tokenization.
You can replace the tokenizer model trained with sentencepiece by using this.
spm_train --input=sentneces_dataset.txt --model_prefix=amharic --vocab_size=8000 --character_coverage=0.9995 --model_type=bpe