This repository contains code for embeddings, plots and results of our paper:
"Reading Beyond the Center. Modeling Book Encounters in the Danish Periphery (1800-1850)" which will be presented at CHR2025.
Some useful directions:
/src/contains scripts to create embeddings/figures/contains the figures generated by the notebooks/notebooks/contains the notebooks used for the analysis
The dataset used in this paper is available at huggingface, which is an earlier version and subset of this dataset.
The trained embeddings are also available at huggingface.
Please cite our [paper](link coming soon) if you use the code, dataset or embeddings:
βββ LICENSE <- Open-source license if one is chosen.
β
βββ README.md <- The top-level README for developers using this project.
β
βββ src/
β β
β βββ process_articles.py <- Code to get embeddings from newspaper article chunks.
β βββ mean_pooling.py <- Code to get average embeddings from newspaper articles.
β βββ merge_text_embs.py <- Merge texts and embeddings.
β
β
βββ data/ <- Data used for the analysis in notebooks.
β
βββ prompt_optimization/ <- Data related to the prompt optimization task with GPT.
β
β
βββ notebooks/ <- Jupyter notebooks.
β β
β βββ classify_articles.ipynb <- Notebook to classify article types.
β βββ explore_and_find_book_ads.ipynb <- Notebook to get descriptive statistics and create a subset of book advertisements.
β βββ create_gold_book_announcements.ipynb <- Notebook to create gold standard book announcements.
β βββ classify_book_announcements.ipynb <- Notebook to classify book announcements.
β βββ api_gpt.ipynb <- Notebook to annotate book titles with GPT.
β βββ analyse_titles.ipynb <- Notebook to analyse book titles and do statistical tests.
β
βββ figures/ <- Generated graphics and figures used in the paper.
