This repo contains arXiv source data, and associated code for preprocessing, labeling, and partitioning it.
The source data are under data/source
as gzipped JSONL files.
After setting up a Python environment, run
python runner.py 'data/source/arxiv-data-20200125-split*.jsonl.gz'
The result will be a preprocessed corpus under data/processed
and various partitions and samples for training under data/train
.