OmniBioTE is a transformer model designed to capture the complex relationships inherent in biological sequences. Drawing inspiration from the BERT architecture, it supports pretraining on large unlabeled datasets and fine-tuning for various downstream tasks. With multiple tokenization strategies and robust distributed training, OmniBioTE offers a comprehensive solution for multimodal biosequence analysis.
- Introduction
- Features
- Data Loading and Preprocessing
- Tokenizer Training
- Model Training
- Evaluation and Finetuning
- GUE Evaluation
- TAPE Evaluation
- ProteinGLUE Evaluation
- Additional Evaluation Modules
- GUE Evaluation (LucaOne Version)
- TAPE Evaluation (LucaOne Version)
- TAPE Evaluation (ESM Version)
- Contact Evaluation (OmniBioTA)
- Contact Evaluation (ESM Version)
- Contact Evaluation (LucaOne Version)
- PDB Contact Evaluation
- Motif Selectivity Evaluation
- Pronab Evaluation
- Pronab Evaluation (Unimodal)
- Pronab Evaluation (LucaOne)
- Single-Character Evaluation Modules
- Additional Modules and Utilities
- Conclusion
OmniBioTE is built to handle the unique characteristics of biological sequences. The model offers two tokenization strategies— a SentencePiece-based byte-pair-encoding model and a single-character tokenizer.
If you're just interested in loading and querying the model, there is a minimal example notebook, src/example.ipynb, to get you started.
- Multimodal Support: Process DNA, RNA, and protein sequences.
- Flexible Tokenization: Choose between SentencePiece tokenization and a single-character tokenizer.
- Robust Data Loading: A custom data loader employs multithreading and prefetching to efficiently stream sequences from compressed files.
- Distributed Training: Integrated support for Distributed Data Parallel (DDP) and FullyShardedDataParallel (FSDP) enables scaling across multiple GPUs.
- Dynamic Checkpointing: Automatically resumes training from the latest checkpoint and manages disk space by cleaning up older checkpoints.
- Comprehensive Evaluation Pipelines: Evaluate the model on tasks ranging from classification and regression to contact map prediction using specialized evaluation modules.
- Performance Metrics: Tools are provided to compute detailed tensor statistics and monitor model performance.
The data loading pipeline, implemented in training/loader.py, includes:
- Multi-Threaded File Preloading: Uses a background thread with a queue to preload gzip-compressed files.
- Randomized Chunking: Splits files into sub-blocks and randomly samples fixed-size chunks for training.
- Token Filtering: Applies a configurable offset to token IDs and filters out banned tokens (such as PAD, MASK, EOS).
The custom dataset class, OmniDataset, combines data from multiple directories based on specified fractions. It shuffles file lists and streams sequences seamlessly into training batches.
OmniBioTE supports two tokenization approaches:
-
SentencePiece Tokenization:
Train a SentencePiece model to handle long biosequences. Tokenizer training scripts (e.g.,tokenization/tokenize_genbank.pyandtokenization/tokenize_uniref.py) and notebooks (e.g.,train_tokenizer.ipynb) guide the process of building an efficient vocabulary. -
Single-Character Tokenization:
A built-in single-character tokenizer (seetrain_encoder-single-char.py) maps each character to a fixed token ID.
Both methods incorporate mechanisms for shifting token IDs by a configurable offset (in order to yield non-overlapping token IDs for two different BPE tokenizers) and banning specific tokens (such as reserved IDs for <PAD>, <MASK>, and <EOS>).
Two primary training scripts address different tokenization schemes:
-
train_encoder.py:
Trains OmniBioTE using BPE tokenizers. This script accepts paths to pre-trained SentencePiece models and adjusts vocabulary offsets and banned tokens accordingly.
Example:torchrun --nnodes=1 --nproc_per_node=4 train_encoder.py \ --n_head 8 --n_embd 1024 --n_layer 8 \ --mini_batch_size 2 --batch_size 1024 \ --lr 0.05 --save_name omnbiote-small \ --DNA_tokenizer path/to/dna.model --peptide_tokenizer path/to/peptide.model \ --tokenizer_suffix 8k --train_type mixed -
train_encoder-single-char.py:
Trains OmniBioTE using the built-in single-character tokenizer. This script directly maps nucleotides and amino acids to token IDs using a fixed vocabulary size. Example:torchrun --nnodes=1 --nproc_per_node=4 train_encoder-single-char.py \ --n_head 8 --n_embd 1024 --n_layer 8 \ --mini_batch_size 2 --batch_size 1024 \ --lr 0.05 --save_name omnbiote-small-single \ --train_type mixed
Both training scripts include:
- Distributed Setup: Uses
torch.distributedfor multi-GPU training with NCCL/Gloo backends. - Gradient Accumulation: Aggregates gradients when per-GPU batch sizes are small.
- Dynamic Checkpointing: Saves checkpoints based on token counts (using
--save_freq) and resumes training using a_last_step.txtfile. - Learning Rate Scheduling: Implements a OneCycleLR scheduler with configurable warmup periods and muP scaling for optimizer consistency.
Key training parameters include:
- Batch & Data Parameters:
--batch_size,--mini_batch_size,--ctx_len,--base_dir - Model Architecture:
--n_head,--n_embd,--n_layer,--dropout,--position_encoding - Optimization:
--lr,--beta1,--beta2,--epsilon,--weight_decay,--force_lr,--token_budget,--warmup_period - Logging & Checkpointing:
--test_freq,--save_freq,--save_name,--wandb_project_name,--disable_flash - Tokenization Options (SentencePiece only):
--DNA_tokenizer,--peptide_tokenizer,--tokenizer_suffix - Other Modes:
--train_type,--FSDP,--use_padding,--compile
OmniBioTE includes a comprehensive suite of evaluation pipelines covering a range of downstream tasks. These pipelines support classification, regression, contact map prediction, and more.
Located in evals/gue.py, this module performs full parameter finetuning on the various GUE tasks.
Run Example:
python evals/gue.py --sp_dir path/to/sp.model --model_dir path/to/model.pt --batch_size 32The evals/TAPE.py module is tailored for protein structure and function tasks:
Run Example:
python evals/TAPE.py --sp_dir path/to/sp.model --model_dir path/to/omnbiote-small.pt --tokenizer_offset 0The tokenizer offset should be set to 2048 for mixed models, since there are 2048 nucleotide tokens and 2048 protein tokens, ordered sequentially.
Located in evals/ProteinGLUE.py.
Run Example:
python evals/ProteinGLUE.py --sp_dir path/to/sp.model --model_dir path/to/omnbiote-small.pt --tokenizer_offset 2048 --output_suffix experiment1We provide the evaluation scripts for several baselines:
- Location:
evals/gue_lucaone.py - Overview:
An alternative GUE evaluation pipeline that uses a customAlphabetclass for tokenization along with the pre-trained LucaOne model from HuggingFace. It performs finetuning with gradient accumulation and OneCycleLR scheduling while computing MCC and weighted F1 scores. - Run Example:
python evals/gue_lucaone.py --num_accum_steps 1 --batch_size 32 --lr 0.00015625 --embed_lr 0.00015625 --head_lr 0.01 --output_suffix my_experiment
- Location:
evals/TAPE_lucaone.py - Overview:
Adapts the TAPE benchmark for protein tasks using the LucaOne model and a customAlphabettokenizer. It supports tasks such as secondary structure, remote homology, fluorescence, and stability. - Run Example:
python evals/TAPE_lucaone.py --batch_size 32 --finetuning_lr 0.0002 --output_suffix my_tape_experiment
- Location:
evals/TAPE_esm.py - Overview:
Provides an alternative TAPE evaluation pipeline based on Facebook’s ESM models. It integrates native ESM tokenization and supports various model sizes (XS, S, M, L, XL) with a similar finetuning and evaluation routine. - Run Example:
python evals/TAPE_esm.py --model_size M --batch_size 32 --finetuning_lr 0.0001 --output_suffix esm_eval
- Location:
contact_eval.py - Overview:
Trains a protein contact predictor for the TAPE evals using an OmniBioTE model. It processes ProteinNet datasets, adds special markers during tokenization, builds a ResNet-based contact predictor head, and evaluates performance (precision and AUPRC) on medium- and long-range contacts. - Run Example:
python contact_eval.py --tokenizer_fn path/to/sp.model --model_fn path/to/omnbiota.pt --banned_token 2044 --tokenizer_offset 2048 --wandb_prefix "MyRun" --data_dir /path/to/data
- Location:
contact_eval_esm.py - Overview:
Adapts the contact evaluation procedure for Facebook’s ESM2 models using HuggingFace’sAutoTokenizerandEsmModel. A ResNet-based contact predictor head is applied to ESM2 embeddings, supporting multiple model sizes with similar training and evaluation setups. - Run Example:
python contact_eval_esm.py --model_size M --data_dir /path/to/data --num_accumulation_steps 128 --num_epochs 128 --head_dim 128 --num_resnet_blocks 8 --logging
- Location:
contact_eval_lucaone.py - Overview:
Implements the contact map prediction using the pre-trained LucaOne model. A customAlphabetclass handles tokenization and maps residue-level contacts into token space. - Run Example:
python contact_eval_lucaone.py --data_dir /path/to/data --num_accumulation_steps 128 --num_epochs 128 --head_dim 128 --num_resnet_blocks 8 --logging
- Location:
evals/pdb_contact_eval.py - Overview:
Provides a pipeline for protein contact map prediction using peptide–nucleotide distance data from a PDB-derived JSON file. - Run Example:
python evals/pdb_contact_eval.py <model_fn> <name_suffix> <lr> <embed_lr> <head_lr> <tokenizer_offset> <distance_threshold>
- Location:
evals/motif_selectivity.py - Overview:
Evaluates sequence motif selectivity by comparing wild-type motifs against mutated or randomly generated variants. This module supports dual tokenization for nucleotide and protein sequences, options for introducing mutations at specified rates, and cross-validation with multiple replicates per motif. It logs ΔG (free energy) differences and provides summary statistics across mutation rates. - Run Example:
python evals/motif_selectivity.py --genbank_model_path path/to/genbank.model --uniref_model_path path/to/uniref.model --banned_sequences_file path/to/banned.txt --jaspar_data_file path/to/JASPAR2024_CORE_processed.json --model_base_dir path/to/models --folds 10 --replicates 8 --output_suffix my_motif_experiment --mutation_rates 0.05,0.1,0.25,0.5
- Location:
evals/pronab.py - Overview:
Finetunes and evaluates OmniBioTA on pronab/mutation data using custom tokenization for nucleotide and protein sequences. The module evaluates G0 predictions using Pearson correlation and mean absolute error (MAE). - Run Example:
python evals/pronab.py --model_fn path/to/model.pt --nucleotide_tokenizer path/to/nuc.model --protein_tokenizer path/to/prot.model --num_accumulation_steps 256 --num_epochs 32 --lr 1e-4 --embed_lr 1e-3 --head_lr 1e-2 --nuc_banned_token 2037 --prot_banned_token 2044 --prot_offset 2048 --save_dir path/to/save --num_splits 10
- Location:
evals/pronab-unimodal.py - Overview:
A unimodal baseline that uses a single-omic nucleic acid model and a single-omic protein model. It mirrors the training and evaluation process of the standard Pronab Evaluation while processing inputs as a single modality. - Run Example:
python evals/pronab-unimodal.py --nuc_model_fn path/to/nuc_model.pt --protein_model_fn path/to/prot_model.pt --nucleotide_tokenizer path/to/nuc.model --protein_tokenizer path/to/prot.model --num_accumulation_steps 256 --num_epochs 32 --lr 1e-4 --embed_lr 1e-3 --head_lr 1e-2 --num_splits 10 --save_dir path/to/save
- Location:
pronab_lucaone.py - Overview:
Adapts the pronab evaluation framework for the LucaOne model from HuggingFace. - Run Example:
python pronab_lucaone.py --num_accumulation_steps 256 --num_epochs 32 --lr 1e-4 --embed_lr 1e-4 --head_lr 1e-2 --num_splits 10 --save_dir path/to/save
These modules provide evaluation pipelines analogous to their SentencePiece-based counterparts while using OmniBioTE’s built-in single-character tokenization.
- Location:
evals/single-char/gue.py - Overview:
- Run Example:
python evals/single-char/gue.py --model_dir path/to/model.pt --num_accum_steps 1 --batch_size 32 --lr 0.00015625 --embed_lr 0.00015625 --head_lr 0.01 --output_suffix singlechar_expt
- Location:
evals/single-char/pronab.py - Overview:
- Run Example:
python evals/single-char/pronab.py --model_dir path/to/model.pt --save_dir path/to/save --num_accum_steps 256 --num_epochs 32 --lr 1e-4 --embed_lr 1e-4 --head_lr 1e-2 --num_splits 10
- Location:
evals/single-char/proteinGLUE.py - Overview:
- Run Example:
python evals/single-char/proteinGLUE.py --model_dir path/to/model.pt --output_suffix singlechar_pg_eval
- Location:
evals/single-char/TAPE.py - Overview:
- Run Example:
python evals/single-char/TAPE.py --model_dir path/to/model.pt --batch_size 32 --finetuning_lr 1e-4 --output_suffix singlechar_tape_eval
-
Metrics Utilities:
Thetraining/metrics.pyfile includes functions to compute tensor statistics (mean, std, L1, L2, min, max, norm) for debugging and monitoring training progress. -
Attention Masking Functions:
Custom JIT-compiled functions in the evaluation modules create block-diagonal attention masks based on<EOS>positions and ensure that padded regions are correctly masked. -
Tokenizer Wrappers:
Lightweight wrappers adjust token IDs (via offsets) and filter out banned tokens to ensure consistency between training and evaluation.
OmniBioTE is a versatile transformer model tailored to the challenges of biological sequence analysis. It features robust data loading, flexible tokenization, scalable training infrastructure, and comprehensive evaluation pipelines covering a wide range of bioinformatics tasks. From pretraining on extensive sequence datasets to fine-tuning on domain-specific tasks such as protein–nucleic acid binding, contact prediction, and motif selectivity, OmniBioTE provides a complete, integrated solution for modern biosequence modeling.
Enjoy exploring the capabilities of OmniBioTE as you train and evaluate your models.