How to use this Tokeniser

What is a Tokenizer?

A tokenizer is a software component that breaks down text into smaller units called tokens. These tokens can be words, characters, or subwords.

How does a Tokenizer work?

A tokenizer typically works by applying a set of rules to the input text.

How did you build this Llama Hindi Tokenizer?

We built this Llama Hindi Tokenizer by combining the AI4Bharat corpus with the Llama 2 32k Tokenizer.

Tokenizer Architecture

Base Architecture: Llama 2 Tokeniser
Input: Hindi text
Tokenization: BPE Tokenizer
Vocabulary: Learned from the AI4Bharat corpus
Output: Sequence of tokens

How to use this Tokeniser

Download special_tokens_map.json, tokenizer.model and tokenizer_config.json to a folder.
Create your Notebook in the parent directory and use this code below to use the tokeniser

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("Path_to_the_downloaded_tokeniser_folder")
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

Further using this tokeniser and the base Llama 2 model, a new Hindi+English LLM can be created by curating an instruction dataset in Hindi and finetuning the base model.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
special_tokens_map.json		special_tokens_map.json
tokenizer.model		tokenizer.model
tokenizer_config.json		tokenizer_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is a Tokenizer?

How does a Tokenizer work?

How did you build this Llama Hindi Tokenizer?

Tokenizer Architecture

How to use this Tokeniser

About

Releases

Packages

License

sidharthsajith/Llama-Hindi-Tokeniser

Folders and files

Latest commit

History

Repository files navigation

What is a Tokenizer?

How does a Tokenizer work?

How did you build this Llama Hindi Tokenizer?

Tokenizer Architecture

How to use this Tokeniser

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages