The aim of this project is to train a model to create an effective citation recommendation program utilizing AI conferences.
We will be using web scraped PDFs from the following conferences:
- aaai
- acl
- aistats
- colt
- cvpr
- eccv
- emnlp
- iccv
- iclr
- icml
- ijcai
- jmlr
- naacl
- neurips
- uai
The data is assumed to follow the following structure:
/mnt/data/Draft2Paper/data
|- grobid/ # scripts to set up and run GROBID
|- processed_pdfs/ # PDFs processed by GROBID into TEI XML format
|- output/ # output of scripts from src
| ββ llm_prompts/ # stores prompts that will be executed by LLM
| |- llm_output/ # stores output of LLM
| |- hetero_data/ # files generated by generate-hetero-data.py
| |- sampler/ # files generated by sampler.py
| |- train/ # files generated by train.py and stage2_train.py
| |- papers.parquet # file generated by id-papers.py
| |- scibert_embeddings.pt # file generate by generate-scibert-embeddings.py
ββ src/ # location of scripts
- run-grobid.sh
- Runs two full image docker contains according to https://grobid.readthedocs.io/en/latest/Run-Grobid/
- One docker container per 4090 GPU we have on our system
- grobid-script.py
- Goes through directory structure in /mnt/data/data/papers and process all the PDFs there
- Splits PDF processing across the two docker containers that are running GROBID
- Stores GROBID format in processed_pdfs/{venue}/{year}/{file}.grobid.tei.xml, matching structure in /mnt/data/data/papers
1. Running ./run-grobid.sh
When running sudo ./run-grobid.sh, check if docker containers are running with sudo docker ps. If there are no docker containers displayed, try to run one one of the docker commands outside of the script by copy and pasting it in the terminal. A possible error message could likely be displayed:
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
If this is the case, a potential fix could be to follow the instructions below for Ubuntu: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
And then running the following to reconfigure Docker:
sudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart docker
2. GROBID Output
GROBID processing is not perfect. It will generate citations that might not exist, miss citations, and fail to properly associate citations.
- generate-citation-llm-prompts.py
- Generates up to two LLM prompts for each valid citation under every processed paper
- Output prompts under llm_output/{venue}/{year}/{file}/ref_b{number}stage{1 or 2}.txt
- run-llm-on-prompts.py
- Runs prompts generated by previous script
- LLM outputs stored under llm_prompts/{venue}/{year}/{file}.csv
- stage3.py
- Since stage3 wasn't generated from previous files, this will be run after to replace all .csv files with stage3 result
- stage3 result is based on citation frequency where if frequency >= 3, output will be frequency, otherwise, output will be 0
- config.py
- not intended to run by itself, but sets files to filter for generate-hetero-data.py and later training
- id-papers.py
- Generates a DataFrame stored in output/papers.parquet
- Each paper is given a unique ID and contains references to papers that exist in our dataset
- References are determined a match based:
- whether there is a paper with a title that matches the title of the reference 90%
- whether the paper within our dataset is published within +/- 1 year from date of reference
- generate-hetero-data.py
- Generates all files needed for training based on config.py
- Stores files in output/hetero_data/
- generate-scibert-embeddings.py
- Generates SciBERT embeddings for sampler.py and store under output/scibert_embeddings.py
- sampler.py
- Generates positive and negative samples to train from
- Splits train/test/validation and stores in output/sampler/
- train.py
- Runs stage 1 of training and stores under output/train/
- stage2_train.py
- Runs stage 2 of training and stores under output/train/
- inference.py
- Runs analysis of trained model
We initially tried using OpenAlex to retrieve metadata and citation information, but OpenAlex has two big limitations:
- It does not have an already processed dataset of full text papers like S2ORC from Semantic Scholar
- It was missing a significant amount of papers from the AI conferences we wanted to view
Before fully committing to webscraping, we also tried Semantic Scholar, but:
- It was missing a significant number of full text papers compared to our web scraping solution