Demonstration code to create a GPT-2-style, decoder-only, generative small LAnguage Model that can be built using personal computing.
This is not for production. You can use this code to learn about generative language models, preprocessing, training, and model hyperparameters.
git clone [email protected]:bioteam/sLAM.git
cd sLAM
pip3 install .
Complete the installation:
>python3
>>> import nltk
>>> nltk.download('punkt_tab')
nltk is used for sentence tokenization.
> python3 sLAM/make-slam.py -h
usage: make-slam.py [-h] [-t TEXT_PERCENTAGE] [--context_size CONTEXT_SIZE] [-n NAME] [--temperature TEMPERATURE]
[--epochs EPOCHS] [--d_model D_MODEL] [-d DOWNLOAD] [--num_rows NUM_ROWS] [--use_mlflow]
-p PROMPT [-v]
options:
-h, --help show this help message and exit
-t TEXT_PERCENTAGE, --text_percentage TEXT_PERCENTAGE
Percentage of download used to make dataset
-m MIN_SENTENCE_LEN, --min_sentence_len MIN_SENTENCE_LEN
Percentage of input text used to make dataset
-n NAME, --name NAME Name used to save files, default is timestamp of start time
--temperature TEMPERATURE
Temperature used for generation
--epochs EPOCHS Number of epochs
--d_model D_MODEL Number of epochs
--context_size CONTEXT_SIZE
Context size
-d DOWNLOAD, --download DOWNLOAD
Dataset to download. Default is cc_news.
--num_rows NUM_ROWS Number of rows to download from cc_news
--use_mlflow USE_MLFLOW
Use MLFlow for model tracking
-p PROMPT, --prompt PROMPT
Prompt
-v, --verbose Verbose
The code uses cs_news (the default) or wikitext-2-v1 from Hugging Face as training text.
Download and clean training data from cs_news, tokenize it into large chunks, create a model, train the model using context-window-sized slices for 3 epochs, be verbose, and try the given prompt:
python3 sLAM/make-slam.py --num_rows 500 -v --epochs 3 -p "This is a test"
This creates a Keras model (~1M input tokens) and a saved (serialized) tokenizer with the same name, and a histogram of sentence lengths. for example:
-rw-r--r-- 332M Apr 1 05:09 04-01-2025-05-09-04.keras
-rw-r--r-- 58K Apr 1 05:09 04-01-2025-05-09-04.pickle
-rw-r--r-- 19K Mar 31 16:04 sentence_length_distribution.png
One epoch takes about ~1 hour on a Mac M1 laptop (32 GB RAM) with the command above. However, more text than that needs to be used to generate syntactically and semantically correct English.
Supply the name of the model and the serialized tokenizer, and a prompt:
python3 sLAM/generate.py -n 04-01-2025-05-09-04 -p "This is a test"
This is a test if your favorite software is the news service for the bottom of the
increasing equipment market is actually plans for their concerns and the narrative
of the same time i think it was the course of the technology is that the 5th us and
i think what we are the most youre doing it we do to do that you want what to avoid
the first amendment and other candidates are not just as the most.
One of the challenges in writing and running Deep Learning code is how many components there are, and how quickly new versions replace old versions. To get all your component versions aligned start with your computer, which may be a GPU. For example, if it's NVIDIA, what is the recommended version of CUDA? From that version find the recommended version of Tensorflow or PyTorch. Then for that package version what version of Python. An example set of versions, working with an older NVIDIA GPU:
RTX 5000 + CUDA 11.8 + Tensorflow 2.17 + Python 3.8
Then the Python dependencies will follow from the Python version.
Getting these versions aligned is critical, because if the versions are out of alignment you may get errors of various kinds that do not reference versions and are difficult to debug, like out-of-memory or data shape errors.
Containers may be available that package all the right versions, e.g. CUDA and Python with some framework. In this example we're computing at Texas Advanced Computing Center and downloading a Tensorflow container from NVIDIA:
srun -N 1 -n 10 -p rtx-dev -t 60:00 --pty bash
module load tacc-apptainer
apptainer pull docker://tensorflow/tensorflow:2.17.0-gpu
Or just:
docker pull tensorflow/tensorflow:2.17.0-gpu
Then you can run your script with singularity
.
singularity exec --nv tensorflow_2.17.0-gpu.sif python3 scripts/mnist_convnet.py