This repository is for EleutherAI's project Semantic Memorization which defines a unique taxonomy for memorized sequences based on factors that influence memorization. For detailed information on how likelihood of a sequence being memorized is dependant on taxonomy, please see our paper Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
Our taxonomy, illustrated above, defines three types of LM memorization based on colloquial de- scriptions of human memorization. Humans recite direct quotes that they commit to memory through repeated exposure, so LMs recite highly duplicated sequences. Humans reconstruct a passage by re- membering a general pattern and filling in the gaps, so LMs reconstruct inherently predictable boiler- plate templates. Humans sporadically recollect an episodic memory or fragment after a single expo- sure, so LMs recollect other sequences seen rarely during training.
To train a natural language vs code classifier, we used huggingface's training pipeline on randomly sampled, equal weight subsets of bookcorpus and github-code. following hparams were used while training
- learning_rate: 1e-07
- train_batch_size: 256
- eval_batch_size: 1024
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- training_steps: 1000
Following this, we used this script to find probabilities of a sequence being memorized.
To replicate results of duplication, run the following scripts in a sequence
- script saving sequence hashes, to save hash of every 32-gram sequence of Pile
- script saving zero offset hashes Script saving hashes of only required offset (32 in our case)
- script saving approximate duplicates, based on hashes. We now have a single numpy file that stores hashes and sequence ids of all sequences whose hashes are the same as atleast one of zero offset sequence's hashes
- script calculating exact duplicates This script compares each sequence with all sequences with same hash to get exact count of duplicates.
- Following this, you get a list of true counts, you can combine them use this script
- You can find already processed list of sequence ids with their count of duplicates in standard and deduped datasets.
To replicate semantics and textual matches filter, run the following scripts in a sequence:
- Create sentence embeddings for various datasets, with this script
- Compute semantic filter counts with this script
- Compute textual matche counts with this script. for texual macthes, we also need to create only query sentences for each partition as we compare levestein distance between queries for this filter. This can be acheived by this script.
To replicate results of token frequences, run this script. Full list of token frequencies can be found on huggingface for standard and deduped datasets.
To combine all the existing filters, run combine metrics script. You will need to setup an appropriate JDK and install all requirements to run the script. Filter results can be found on this huggingface dataset
Note: Filters for templating (incrementing and repeating) as well has huffman coding length are calculated while the filters are combined.
To train taxonomic model and launch greedy taxonomic search, launch this script
- To replicate results on taxonomic model performance, and plots on model weights refer to this notebook.
- For results on correlation coefficients, refer to this notebook
- For plot on optimal thresholds for code-classifier, refer to this notebook
@article{prashanth2024recite,
title={Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon},
author={Prashanth, USVSN Sai and Deng, Alvin and O'Brien, Kyle and SV, Jyothir and Khan, Mohammad Aflah and Borkar, Jaydeep and Choquette-Choo, Christopher A and Fuehne, Jacob Ray and Biderman, Stella and Ke, Tracy and Lee, Katherine and Saphra, Naomi},
journal={arXiv preprint arXiv:2406.17746},
year={2024}
}