-
install requirements
-
run main command like so:
python -m src.sentiment_benchmarking
--dataset-name chcaa/fiction4sentiment
--model-names cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual
--model-names cardiffnlp/xlm-roberta-base-sentiment-multilingual
--model-names MiMe-MeMo/MeMo-BERT-SA
--model-names alexandrainst/da-sentiment-base
--model-names vesteinn/danish_sentiment
--translateoption to add, for testing a number of rows to run, e.g.:
--n-rows 10option to add, for google-translating,
--translate(NB: will be slow going)I.e.: script, model_names (can be multiple) and n-rows (optional) and translate (optional)
-
results (csv cols data & spearman results) in
resultsfolder
- takes (transformers-compatible) finetuned models to SA-score sentences of a HF dataset (default Fiction4-dataset); format should be ["text", "label"], where label is human gold standard
- cleans up text (whitespace removal mainly)
- takes the categorical scoring (positive, [neutral,] negative) and turns it continuous, using the model's assigned confidence score (see
utils.py, conv_scores function) & saves results - computes the spearman correlation of chosen models (+precomputed* vader and roberta_xlm_base) with human gold standard ("label") both on the raw and -- [forthcoming] detrended gold standard (see
utils.py)
*Note that the precomputed vader & roberta_xlm_base were applied to Google-translated sentences. For details, see our paper here -- these are the baselines to beat.
- Compare performance on continuous-scale converted scores to binary classification performance (using, e.g., the memo dataset: https://huggingface.co/datasets/MiMe-MeMo/MeMo-Dataset-SA)