A tokenizer is a software component that breaks down text into smaller units called tokens. These tokens can be words, characters, or subwords.
A tokenizer typically works by applying a set of rules to the input text.
We built this Llama Hindi Tokenizer by combining the AI4Bharat corpus with the Llama 2 32k Tokenizer.
- Base Architecture: Llama 2 Tokeniser
- Input: Hindi text
- Tokenization: BPE Tokenizer
- Vocabulary: Learned from the AI4Bharat corpus
- Output: Sequence of tokens
- Download
special_tokens_map.json
,tokenizer.model
andtokenizer_config.json
to a folder. - Create your Notebook in the parent directory and use this code below to use the tokeniser
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("Path_to_the_downloaded_tokeniser_folder")
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
- Further using this tokeniser and the base Llama 2 model, a new Hindi+English LLM can be created by curating an instruction dataset in Hindi and finetuning the base model.