Continuous sentiment scores for literary and multilingual contexts

This is a github repository for reproducing the results of Continuous sentiment scores for literary and multilingual contexts.

📋 Table of Contents

Overview
Installation
Usage
- Reproducing Paper Results
Sanity-Check of the Sentiment Vector
- Distribution Analysis
- Word-Level Analysis

🔍 Overview

This repository extracts sentiment from contextual embeddings by using linear projection in embedding space. The repository contains the sentiment datasets and the functions used to embed text, define concept vectors, and project new data.

Concretly, it is a project developing a technique for extracting information from contextual sentence embeddings (model="paraphrase-multilingual-mpnet-base-v2") by utilizing projection of embeddings onto a concept vector.

The pipeline of the Semantic Projection algorithm is roughly visualised in the image below, and can be executed and validated by running the main.py script.

If you are instead looking for a quick implementation of Semantic Projection, please check out the SemanticProjection package: https://github.com/lauritswl/SemanticProjection

🛠️ Installation

Clone the repository and navigate to it

git clone https://github.com/centre-for-humanities-computing/embedding-projection.git
cd embedding-projection

Create and activate virtual environment with uv

uv venv
source .venv/bin/activate

Install dependencies from pyproject.toml

python -m venv venv
source venv/bin/activate
pip install -e .

Requirements: Dependencies needed to run main.py can be found in the pyproject.toml.

🚀 Usage

📊 Reproducing Paper Results

Run main.py to reproduce the plots and results used in the paper:

# Make sure your virtual environment is activated
source .venv/bin/activate

# Run the main script
python main.py

# The script will:
# 1. Load and preprocess the datasets
# 2. Create embeddings using MPNET
# 3. Generate concept vectors
# 4. Project test data onto concept vectors
# 5. Create plots in the plots/ directory

🧪 Sanity-Check of the Sentiment Vector

It seems that there is a rather strong correlation between average human anotator and the projection method! This is seen in the scatterplot below, visualising the correlation between predictions and annotators for the EmoBank dataset (which is left out of training dataset):

📈 Distribution Analysis

The projection of binary-classified IMDB reviews onto our Sentiment Vector shows clear separation between positive and negative sentiments:

🔤 Word-Level Analysis

To validate our approach, we projected individual words from the corpus onto the Sentiment Vector. This method, inspired by S3 - Semantic Signal Separation. The script for doing this is not included in the repo:

⬆️ Highest Projection Score

pleasure    anytime     admired     admire      fabulous
classical   beloved     romantic    anthologies  lovely

⬇️ Lowest Projection Score

worse       terrible    sucked      horrible    worst
bad         rotten      unacceptable stupidity   awful

⚠️ Note: it seems that the vector might be correlated with the romantic literature period (H.C.Andersen), i.e. "anthologies, classical, romantic". This might be a byproduct of fairytales having a high density of positive semantics, thus being overrepresented in the training set.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
plots		plots
powerpoints		powerpoints
src		src
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Continuous sentiment scores for literary and multilingual contexts

📋 Table of Contents

🔍 Overview

🛠️ Installation

🚀 Usage

📊 Reproducing Paper Results

🧪 Sanity-Check of the Sentiment Vector

📈 Distribution Analysis

🔤 Word-Level Analysis

⬆️ Highest Projection Score

⬇️ Lowest Projection Score

About

Uh oh!

Contributors 3

Uh oh!

Languages

License

centre-for-humanities-computing/embedding-projection

Folders and files

Latest commit

History

Repository files navigation

Continuous sentiment scores for literary and multilingual contexts

📋 Table of Contents

🔍 Overview

🛠️ Installation

🚀 Usage

📊 Reproducing Paper Results

🧪 Sanity-Check of the Sentiment Vector

📈 Distribution Analysis

🔤 Word-Level Analysis

⬆️ Highest Projection Score

⬇️ Lowest Projection Score

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages