This is a github repository for reproducing the results of Continuous sentiment scores for literary and multilingual contexts.
This repository extracts sentiment from contextual embeddings by using linear projection in embedding space. The repository contains the sentiment datasets and the functions used to embed text, define concept vectors, and project new data.
Concretly, it is a project developing a technique for extracting information from contextual sentence embeddings (model="paraphrase-multilingual-mpnet-base-v2") by utilizing projection of embeddings onto a concept vector.
The pipeline of the Semantic Projection algorithm is roughly visualised in the image below, and can be executed and validated by running the main.py script.

If you are instead looking for a quick implementation of Semantic Projection, please check out the SemanticProjection package: https://github.com/lauritswl/SemanticProjection
- Clone the repository and navigate to it
git clone https://github.com/centre-for-humanities-computing/embedding-projection.git
cd embedding-projection- Create and activate virtual environment with uv
uv venv
source .venv/bin/activate- Install dependencies from pyproject.toml
python -m venv venv
source venv/bin/activate
pip install -e .Requirements: Dependencies needed to run main.py can be found in the pyproject.toml.
Run main.py to reproduce the plots and results used in the paper:
# Make sure your virtual environment is activated
source .venv/bin/activate
# Run the main script
python main.py
# The script will:
# 1. Load and preprocess the datasets
# 2. Create embeddings using MPNET
# 3. Generate concept vectors
# 4. Project test data onto concept vectors
# 5. Create plots in the plots/ directoryIt seems that there is a rather strong correlation between average human anotator and the projection method! This is seen in the scatterplot below, visualising the correlation between predictions and annotators for the EmoBank dataset (which is left out of training dataset):
The projection of binary-classified IMDB reviews onto our Sentiment Vector shows clear separation between positive and negative sentiments:
To validate our approach, we projected individual words from the corpus onto the Sentiment Vector. This method, inspired by S3 - Semantic Signal Separation. The script for doing this is not included in the repo:
pleasure anytime admired admire fabulous
classical beloved romantic anthologies lovely
worse terrible sucked horrible worst
bad rotten unacceptable stupidity awful


