Skip to content

The AI Project I built when working as a research assistant! Does it work??? erm not really, but I probably learned something πŸ˜…

Notifications You must be signed in to change notification settings

Meowtomata/Draft2Paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Draft2Paper

Goal

The aim of this project is to train a model to create an effective citation recommendation program utilizing AI conferences.

What data will we use to train the model?

We will be using web scraped PDFs from the following conferences:

  • aaai
  • acl
  • aistats
  • colt
  • cvpr
  • eccv
  • emnlp
  • iccv
  • iclr
  • icml
  • ijcai
  • jmlr
  • naacl
  • neurips
  • uai

Where will our data be stored? (Tentative)

The data is assumed to follow the following structure:

/mnt/data/Draft2Paper/data
|- grobid/                  # scripts to set up and run GROBID
|- processed_pdfs/          # PDFs processed by GROBID into TEI XML format
|- output/                  # output of scripts from src
|  β”œβ”€ llm_prompts/          # stores prompts that will be executed by LLM
|  |- llm_output/           # stores output of LLM
|  |- hetero_data/          # files generated by generate-hetero-data.py 
|  |- sampler/              # files generated by sampler.py
|  |- train/                # files generated by train.py and stage2_train.py
|  |- papers.parquet        # file generated by id-papers.py
|  |- scibert_embeddings.pt # file generate by generate-scibert-embeddings.py
β”œβ”€ src/                     # location of scripts

What scripts do we have?

GROBID Scripts (Run in order)

  1. run-grobid.sh
  1. grobid-script.py
  • Goes through directory structure in /mnt/data/data/papers and process all the PDFs there
  • Splits PDF processing across the two docker containers that are running GROBID
  • Stores GROBID format in processed_pdfs/{venue}/{year}/{file}.grobid.tei.xml, matching structure in /mnt/data/data/papers

Known GROBID Issues:

1. Running ./run-grobid.sh

When running sudo ./run-grobid.sh, check if docker containers are running with sudo docker ps. If there are no docker containers displayed, try to run one one of the docker commands outside of the script by copy and pasting it in the terminal. A possible error message could likely be displayed:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

If this is the case, a potential fix could be to follow the instructions below for Ubuntu: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

And then running the following to reconfigure Docker:

  1. sudo nvidia-ctk runtime configure --runtime=docker
  2. sudo systemctl restart docker

2. GROBID Output

GROBID processing is not perfect. It will generate citations that might not exist, miss citations, and fail to properly associate citations.

src Scripts

Label Generation Scripts (Run in order)

  1. generate-citation-llm-prompts.py
  • Generates up to two LLM prompts for each valid citation under every processed paper
  • Output prompts under llm_output/{venue}/{year}/{file}/ref_b{number}stage{1 or 2}.txt
  1. run-llm-on-prompts.py
  • Runs prompts generated by previous script
  • LLM outputs stored under llm_prompts/{venue}/{year}/{file}.csv
  1. stage3.py
  • Since stage3 wasn't generated from previous files, this will be run after to replace all .csv files with stage3 result
  • stage3 result is based on citation frequency where if frequency >= 3, output will be frequency, otherwise, output will be 0

Pretraining Scripts (Run in order)

  1. config.py
  • not intended to run by itself, but sets files to filter for generate-hetero-data.py and later training
  1. id-papers.py
  • Generates a DataFrame stored in output/papers.parquet
  • Each paper is given a unique ID and contains references to papers that exist in our dataset
  • References are determined a match based:
    • whether there is a paper with a title that matches the title of the reference 90%
    • whether the paper within our dataset is published within +/- 1 year from date of reference
  1. generate-hetero-data.py
  • Generates all files needed for training based on config.py
  • Stores files in output/hetero_data/
  1. generate-scibert-embeddings.py
  • Generates SciBERT embeddings for sampler.py and store under output/scibert_embeddings.py

Training Scripts (Run in order)

  1. sampler.py
  • Generates positive and negative samples to train from
  • Splits train/test/validation and stores in output/sampler/
  1. train.py
  • Runs stage 1 of training and stores under output/train/
  1. stage2_train.py
  • Runs stage 2 of training and stores under output/train/
  1. inference.py
  • Runs analysis of trained model

Have you tried other solutions?

We initially tried using OpenAlex to retrieve metadata and citation information, but OpenAlex has two big limitations:

  1. It does not have an already processed dataset of full text papers like S2ORC from Semantic Scholar
  2. It was missing a significant amount of papers from the AI conferences we wanted to view

Before fully committing to webscraping, we also tried Semantic Scholar, but:

  1. It was missing a significant number of full text papers compared to our web scraping solution

About

The AI Project I built when working as a research assistant! Does it work??? erm not really, but I probably learned something πŸ˜…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published