Amharic Tokenizer

A simple Amharic text tokenizer with real-time colorized visualization of tokens. Supports multiple tokenization modes including server-side SentencePiece, local BPE-like, character, whitespace, and random subword tokenization.

Screenshot 2025-09-30 at 11 24 29 in the morning

You can replace the tokenizer model trained with sentencepiece by using this.

spm_train --input=sentneces_dataset.txt --model_prefix=amharic --vocab_size=8000 --character_coverage=0.9995 --model_type=bpe

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
amharic.model		amharic.model
amharic.vocab		amharic.vocab
index.html		index.html
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Amharic Tokenizer

About

Uh oh!

Releases

Packages

Languages

HenokB/amtiktoken

Folders and files

Latest commit

History

Repository files navigation

Amharic Tokenizer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages