Skip to content
/ NERRE Public

Code and data for the publication "Structured information extraction from scientific text with large language models" by Dagdelen & Dunn et al.

License

Notifications You must be signed in to change notification settings

lbnlp/NERRE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-NERRE - Structured Data Extraction

For the publication "Structured information extraction from scientific text with large language models" in Nature Communications by John Dagdelen*, Alexander Dunn*, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain.

* = Equal contribution

This repository contains code for extracting structured relational data as JSON documents from complex scientific text, with particular application to materials science. For the Llama-2 fine-tuned models and code, see the supplemetary nerre-llama repo.

Contents

General/MOF JSON models (general_and_mofs subdirectory):

  • Code reproducing the General-JSON and MOF-JSON models:
    • Code for fine-tuning GPT-3 models using the data shown in the paper to obtain similar results.
    • Code for preparing cross-validation splits for all models
    • Code for scoring results for all models
    • Initial inputs (annotations), intermediate files (if applicable), and final outputs for results shown in the paper for all models
    • For Llama-2 fine-tuning code and weights, see the supplementary nerre-llama repo
  • Includes the annotation UI (including optional in-the-loop annotation if you have your own LLM fine tune) for annotating new datasets!

Doping models (doping subdirectory):

  • Code for reproducing the Doping models:
    • Code for fine-tuning GPT-3 models using the data shown in the paper to obtain similar results, for all three schema
    • Code for preparing train/test splits for all models/schemas
    • Code for scoring results for all models/schemas
    • Initial inputs (annotations), intermediate files (incl. decoded entries), and final outputs for results shown in the paper for all models/schemas
    • For Llama-2 fine-tuning code and weights for all schemas, see the supplementary nerre-llama repo
  • Includes annotation CLI for annotating new doping examples from your own data.

Software requirements

  • Python >3.7.3 (Unix, MacOS and Linux)
  • Software has been run on Python 3.7.3 on MacOS Ventura 13.4.

Software requirements for specific python packages are given as requirements.txt files in each subdirectory (with required versions specified).

Installing and running the evaluation code

Specific instructions for each task are given in the subdirectories in the readme.md files. Running the scripts is done either through the command line (python <script_name.py> [options]) or through Jupyter notebook. Running the scripts does not require installation, but does require the packages in the requirements.txt files.

Demo and expected output are given in the readme.md files in each subdirectory. Expected runtimes are several seconds to several minutes.

About

Code and data for the publication "Structured information extraction from scientific text with large language models" by Dagdelen & Dunn et al.

Resources

License

Stars

Watchers

Forks

Packages

No packages published