This is a tokenizer for DNA ⛓️ that aligns with the central dogma of molecular biology. The Genomic Tokenizer (GT) incorporates the biological process flow into a standard tokenizer interface within the HuggingFace transformer package. GT can be used to pre-train foundational transformer models on DNA sequences. [Read the preprint]
Please cite / contact me if you use it in your research.
pip install genomic-tokenizer
pip install git+https://github.com/dermatologist/genomic-tokenizer.git
from genomic_tokenizer import GenomicTokenizer
# Fasta header if present is ignored.
fasta = """
AGGCGAGGCGCGGGCGGAGGCGGTGCGCGGGCGGAGGCGGGGCGCGGAGATGTGGCGGAGGTGGAGGCGG
AGGCGTAGCCGCCCCTGGGGACGTCATTGGTGGCGGAAGCAATCGCCGGCAACCAGCTGTAAGCGAGGTA
GGCTCACTCGGGCACGGAGGGTGCGGGTGAGAAAGGGAACGATTTGCTAGGAGTGTATGCGCCCGTGCTA
"""
model_max_length = 2048
tokenizer = GenomicTokenizer(model_max_length) # Use this in your model training pipeline
tokens = tokenizer(fasta)
print(tokens)
{'input_ids': [2, 7, 12, 17, 19, 16, 1, 7, 20, 6, 12, 21, 16, 12, 20, 12, 12, 8, 12, 1, 10, 20, 10, 20, 11, 7, 20, 21, 23, 8, 7, 20, 7, 6, 12, 21, 19, 10, 11, 16, 19, 7, 1, 22, 7, 1, 19, 21, 7, 16, 1, 21, 12, 23, 19, 12, 20, 6, 1],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
- Identify the first occurrence of the start codon
ATG
. - Split the sequence into codons of length 3 starting from the start codon.
- Convert synonymous codons to the same token.
- Convert stop codons to
[SEP]
token.
@misc{GT-Eapen2025,
title = {Genomic {{Tokenizer}}: {{Toward}} a Biology-Driven Tokenization in Transformer Models for {{DNA}} Sequences},
shorttitle = {Genomic {{Tokenizer}}},
author = {Eapen, Bell Raj},
year = {2025},
month = apr,
pages = {2025.04.02.646836},
publisher = {bioRxiv},
doi = {10.1101/2025.04.02.646836},
urldate = {2025-04-11},
}
If you find this project useful, give us a star. It helps others discover the project.