Draft2Paper

Goal

The aim of this project is to train a model to create an effective citation recommendation program utilizing AI conferences.

What data will we use to train the model?

We will be using web scraped PDFs from the following conferences:

aaai
acl
aistats
colt
cvpr
eccv
emnlp
iccv
iclr
icml
ijcai
jmlr
naacl
neurips
uai

Where will our data be stored? (Tentative)

The data is assumed to follow the following structure:

/mnt/data/Draft2Paper/data
|- grobid/                  # scripts to set up and run GROBID
|- processed_pdfs/          # PDFs processed by GROBID into TEI XML format
|- output/                  # output of scripts from src
|  ├─ llm_prompts/          # stores prompts that will be executed by LLM
|  |- llm_output/           # stores output of LLM
|  |- hetero_data/          # files generated by generate-hetero-data.py 
|  |- sampler/              # files generated by sampler.py
|  |- train/                # files generated by train.py and stage2_train.py
|  |- papers.parquet        # file generated by id-papers.py
|  |- scibert_embeddings.pt # file generate by generate-scibert-embeddings.py
├─ src/                     # location of scripts

What scripts do we have?

GROBID Scripts (Run in order)

run-grobid.sh

Runs two full image docker contains according to https://grobid.readthedocs.io/en/latest/Run-Grobid/
One docker container per 4090 GPU we have on our system

grobid-script.py

Goes through directory structure in /mnt/data/data/papers and process all the PDFs there
Splits PDF processing across the two docker containers that are running GROBID
Stores GROBID format in processed_pdfs/{venue}/{year}/{file}.grobid.tei.xml, matching structure in /mnt/data/data/papers

Known GROBID Issues:

1. Running ./run-grobid.sh

When running sudo ./run-grobid.sh, check if docker containers are running with sudo docker ps. If there are no docker containers displayed, try to run one one of the docker commands outside of the script by copy and pasting it in the terminal. A possible error message could likely be displayed:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

If this is the case, a potential fix could be to follow the instructions below for Ubuntu: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

And then running the following to reconfigure Docker:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

2. GROBID Output

GROBID processing is not perfect. It will generate citations that might not exist, miss citations, and fail to properly associate citations.

src Scripts

Label Generation Scripts (Run in order)

generate-citation-llm-prompts.py

Generates up to two LLM prompts for each valid citation under every processed paper
Output prompts under llm_output/{venue}/{year}/{file}/ref_b{number}stage{1 or 2}.txt

run-llm-on-prompts.py

Runs prompts generated by previous script
LLM outputs stored under llm_prompts/{venue}/{year}/{file}.csv

stage3.py

Since stage3 wasn't generated from previous files, this will be run after to replace all .csv files with stage3 result
stage3 result is based on citation frequency where if frequency >= 3, output will be frequency, otherwise, output will be 0

Pretraining Scripts (Run in order)

config.py

not intended to run by itself, but sets files to filter for generate-hetero-data.py and later training

id-papers.py

Generates a DataFrame stored in output/papers.parquet
Each paper is given a unique ID and contains references to papers that exist in our dataset
References are determined a match based:
- whether there is a paper with a title that matches the title of the reference 90%
- whether the paper within our dataset is published within +/- 1 year from date of reference

generate-hetero-data.py

Generates all files needed for training based on config.py
Stores files in output/hetero_data/

generate-scibert-embeddings.py

Generates SciBERT embeddings for sampler.py and store under output/scibert_embeddings.py

Training Scripts (Run in order)

sampler.py

Generates positive and negative samples to train from
Splits train/test/validation and stores in output/sampler/

train.py

Runs stage 1 of training and stores under output/train/

stage2_train.py

Runs stage 2 of training and stores under output/train/

inference.py

Runs analysis of trained model

Have you tried other solutions?

We initially tried using OpenAlex to retrieve metadata and citation information, but OpenAlex has two big limitations:

It does not have an already processed dataset of full text papers like S2ORC from Semantic Scholar
It was missing a significant amount of papers from the AI conferences we wanted to view

Before fully committing to webscraping, we also tried Semantic Scholar, but:

It was missing a significant number of full text papers compared to our web scraping solution

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
data		data
webapp		webapp
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Draft2Paper

Goal

What data will we use to train the model?

Where will our data be stored? (Tentative)

What scripts do we have?

GROBID Scripts (Run in order)

Known GROBID Issues:

src Scripts

Label Generation Scripts (Run in order)

Pretraining Scripts (Run in order)

Training Scripts (Run in order)

Have you tried other solutions?

About

Uh oh!

Releases

Packages

Languages

Meowtomata/Draft2Paper

Folders and files

Latest commit

History

Repository files navigation

Draft2Paper

Goal

What data will we use to train the model?

Where will our data be stored? (Tentative)

What scripts do we have?

GROBID Scripts (Run in order)

Known GROBID Issues:

src Scripts

Label Generation Scripts (Run in order)

Pretraining Scripts (Run in order)

Training Scripts (Run in order)

Have you tried other solutions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages