Skip to content

HenokB/amtiktoken

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amharic Tokenizer

A simple Amharic text tokenizer with real-time colorized visualization of tokens. Supports multiple tokenization modes including server-side SentencePiece, local BPE-like, character, whitespace, and random subword tokenization.

Screenshot 2025-09-30 at 11 24 29 in the morning

You can replace the tokenizer model trained with sentencepiece by using this.

spm_train --input=sentneces_dataset.txt --model_prefix=amharic --vocab_size=8000 --character_coverage=0.9995 --model_type=bpe

About

Amharic tiktokon style visualization based on sentencepiece

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published