This repo accompanies our paper to distinguish feuilleton fiction in Danish newspapers.
📝 In notesyou will find the annotation scheme for the fiction/nonfiction categorization
In scripts you'll find the code, including:
get_features.pyto get MFWs, TF-IDF, and stylistic/syntactic/affective features, the functions of which are defined inscripts/feature_utils.py.classify.pywhich employs a random forest model across our 4 different feature sets (MFW100, TF-IDF, selected features, and embeddings)descriptives.pywhich visualizes and test differences between the classes of fiction/nonfiction- a
clustering_task.pywhich tests embeddings for clustering feuilleton series (note that these need to be precomputed and are not available here because of size-issues)
Note that the script for creating embeddings (various) is at this anonymized repo
And that the script to benchmark SA models on the Fiction4 corpus is in this anonymized repo