Enhancing Sentiment Analysis: Model Comparison, Domain Adaptation, and Lexicon Evolution in German Data
The code in this repository is part of the Masterthesis handed in by Robert Georg Geislinger
All used code, datasets and models are stated below.
Also how to use the code and where to change the variables.
GermEval
OMP
Schmidt
Wikipedia
GerVADER
Guhr
Lxyuan
Gemma 7B
Gemma 7B Instruct
Llama 2 13B Chat
Llama 3 8B
Llama 3 8B Instruct
Mistral 7B Instruct v0.2
Mistral 8x7B Instruct v0.1
multilingual-e5-large-instruct
- Tested Models: GerVADER, Guhr, Lxyuan, Gemma 7B Instruct, Llama 2 13B Chat, Llama 3 8B Instruct, Mistral 7B Instruct, Mistral 8x7B Instruct
- Tested Datasets: GermEval, OMP, Schmidt, Wikipedia
Not every model runs on every dataset on a nvidia A100 80G
- Clone GerVADER
- Convert the dataset with
GerVaderConv.py
into the GerVADER format - Copy the file into the GerVADER directory
- Run GerVADER:
python GERvaderModule.py
- Select 1
- Choose dataset (GerVADER only supports lowercase filename!)
- Evaluate the results
rq1_ml.py
- Select the model and the dataset and run Lxyuan or Guhr with preprocessing
Results are stored in results_rq1
rq1_llms.sh
- Runs the experiments with different models on different datasetsrq1_llms.py
- The actual llm code
Results are stored in results_rq1
- Tested Models: Gemma 7B Instruct, Llama 3 8B Instruct, Mistral 7B Instruct
- Tested Datasets: GermEval, OMP, Schmidt
rq21.sh
- Runs the experimentsrq21.py
- The actual code. Select here which context to use. GermEval, OMP, Schmidt or all
Results are stored in results_rq21
- Tested Models: Gemma 7B Instruct, Llama 3 8B Instruct, Mistral 7B Instruct
- Tested Datasets: GermEval, OMP, Schmidt
rq22.sh
- Runs the experiments. Each configuration three times. Set here the training sizes and the modelrq22.py
- The actual training code. Set here the dataset.
- Tested Models: multilingual-e5-large-instruct
- Tested Datasets: GermEval, OMP, Schmidt
To change the GerVADER dict, small changes a necessary.
Replace the following lines in vaderSentimentGER.py
at line 310 to load a custom lexicon:
old:
with open('outputmap.txt', 'w') as f:
print (lex_dict, file=f)
new:
with open('outputmap.txt', 'r') as f:
new_str_dict = f.read()
lex_dict = json.loads(new_str_dict)
Replace the outputmap.txt
file with the generated lexicon
- Tested Models: Gemma, Llama 3, Mistral
- Tested Datasets: GermEval, OMP, Schmidt
rq31.py
- The actual code. Set here the dataset or change the context.- The dictionary is saved as
dicts/rq31_$dataset$_$model$.txt
- The dictionary is saved as
rq3_postprocessing.py
- Postprocessing of the Lexicon.
The resulting lexicon can be directly used in GerVADER by replacing the existing outputmap.txt
.
- Tested Models: Gemma, Llama 3
- Tested Datasets: GermEval, OMP, Schmidt
Create and load a seperate virtual environment (e.g., Python 3.10).
Newer sklearn is not compatible with the cTFIDF implementation.
python -m venv envcTFIDF
source envcTFIDF/bin/activate
- clone git repository:
https://github.com/MaartenGr/cTFIDF.git
- Install the requirements:
pip install -r cTFIDF/requirements.txt
- run the script. Set the variable
seed_size
for the size of the seed lexicon. 0 = full lexiconrq32_ctfidf.py
- create a folder
weaviate/
- Start weaviate in a docker container:
docker compose -f docker-compose.yml start -d
- Create the embeddings for each model and each dataset
rq32_create_embeddings.py
- Select the model and the datasetrq32_extend_lexicon.py
- Select the seed lexicon, model and dataset. Extend the seed lexicon.rq3_postprocessing.py
- Postprocessing of the lexicon.
All plots can be used with latex backend for font generaration.
plot_confucionMatrices.py
- Plot and results for the Confusion Matrices (RQ1 RQ2.1 RQ2.2)plot_rq22.py
- Lineplots with scattering (RQ2.2)plot_rq23.py
- Lineplot with scattering (RQ2.3)plot_rq31_bins.py
- Histograms (RQ3.1)
Generated plots are stored in plots/