Lucas Georges Gabriel Charpentier and David Samuel
University of Oslo
Language Technology Group
Paper
HuggingFace 100M model
HuggingFace 10M model
100M Dataset
10M Dataset
We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.
This is the official repository for our BabyLM 2024 submission: GPT-BERT.
Completed files/folders:
- data
- model_checkpoints
- tokenizers
- configs
- tokenizer_creation
- pretraining
- configs
- corpus_tokenization
Incomplete files/folders:
- evaluation
./tokenizer_creation/
: Contains scripts for creating a tokenizer../corpus_tokenization/
: Contains scripts to tokenize a corpus../pretraining/
: Contains scripts to train a pre-train a model, as well as the model file itself, utils, optimizers, and the PyTorch datasets../evaluation/
: Contains folders for each benchmark evaluated in the paper. Each folder contains scripts to do fine-tuning (when relevant) and inference as well as a data folder containing the data of the benchmark../data/
: Folder containing the raw, preprocessed, and tokenized data for pretraining../tokenizers/
: Folder containing the tokenizers created, or needed for pretraining../configs/
: Folder containing the configuration files for models../model_checkpoints/
: Folder containing the pre-trained model checkpoints.
This is will be a general guide to pretraining the model, to find out what files to run and what they do, each subfolder will contain a README detailing its content.
- (optional) If you do not have a tokenizer, or want to create a custom one, run the script(s) found in
tokenizer_creation
. The created tokenizers will be saved intokenizers
(unless otherwise specified). - To tokenize the corpus, run the script in
corpus_tokenization
. The tokenized data will be saved in the folderdata
(unless otherwise specified). We tokenize before training for efficiency, but in the case this is not wanted, code will need to be adapted in the scripts found inpretraining
(specifically thedataset.py
file). - Create a config file for your model in the same style as the ones found in the
configs
folder. Otherwise, choose one of the pre-created ones. - To pre-train your model, run one of the
train_*.py
scripts found in thepretraining
folder. (More details found in the folder itself) - (optional) If you want to evaluate your model based on the evaluations used in the paper, the different tasks and code to run the evaluation can be found in
evaluation
. Note: to be able to use each part independently of another, the model file is also included in each benchmark folder.
@misc{charpentier2024gptbertboth,
title={GPT or BERT: why not both?},
author={Lucas Georges Gabriel Charpentier and David Samuel},
year={2024},
eprint={2410.24159},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.24159},
}