🧬 Single-Cell Metadata Extractor with GPT

This project automates the extraction of structured metadata from scientific papers in PDF format. Using the OpenAI GPT model, it populates a standardised Excel workbook with metadata for single-cell RNA-seq studies. Each PDF is parsed and its contents used to fill out the appropriate fields across multiple Excel sheets, one per metadata category.

📁 Project Structure

.
├── pdfs/                     # Folder containing input PDF files
├── completed_manifests/     # Output folder for generated Excel files
├── done/                    # Archive for processed PDF files
├── sc_rnaseq_mixs_v0.1_base_unprotected.xlsx  # Base Excel template
├── extract_metadata.py      # Main script
├── README.md                # This file

⚙️ Requirements

    Python 3.8+

    Dependencies:

        openai

        pandas

        openpyxl

        PyMuPDF (install via pip install pymupdf)

Install everything with:

pip install -r requirements.txt

🔑 Environment Variables

Set your OpenAI API key in your environment:

export GPT_KEY=your-openai-api-key

🚀 Usage

    Prepare PDFs: Place your scientific paper PDFs in the pdfs/ directory.

    Ensure base Excel file is present: The sc_rnaseq_mixs_v0.1_base_unprotected.xlsx template should be in the root directory.

    Run the script:

python extract_metadata_to_manifest.py

The script will:

    Extract text from each PDF.

    Use OpenAI GPT to extract metadata for each worksheet (study, person, sample, etc.).

    Write results into a new Excel file in completed_manifests/.

    Move the original PDF to done/ when finished.

🧠 Metadata Context & GPT Prompting

GPT is prompted with detailed domain-specific context for single-cell genomics. Each worksheet is filled by asking GPT to extract required fields from the full text of the paper. The script ensures:

    One row per item (no arrays).

    Optional fields may be blank; required ones are prioritised.

    Unique IDs are created and preserved across sheets.

🛠️ Notes & Tips

    If the script fails to parse the GPT output as JSON, it will still produce a row of placeholder data.

    Sheet column widths are auto-adjusted for readability.

    Sheets are marked visible in the output workbook.

    Only sheets present in the template will be processed.

    GPT model used: gpt-4o ("o3" alias).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
completed_manifests		completed_manifests
done		done
pdfs		pdfs
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_metadata_to_manifest.py		extract_metadata_to_manifest.py
get_pdf_from_doi.py		get_pdf_from_doi.py
get_pdf_from_doi_using_playwright.py		get_pdf_from_doi_using_playwright.py
open_DOI_pages.py		open_DOI_pages.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Single-Cell Metadata Extractor with GPT

📁 Project Structure

About

Uh oh!

Releases

Packages

Languages

License

TGAC/Scraper_sc

Folders and files

Latest commit

History

Repository files navigation

🧬 Single-Cell Metadata Extractor with GPT

📁 Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages