Emil Trenckner Jessen & Johan Kresten Horsmans
Link to synopsis »
Table of Contents
This project assesses the generalizability of fake news detection algorithms, focussing on data quantity and quality. We do this by firstly training and cross-testing BERT models on two very different datasets; one large dataset of relatively poor quality and another which contains significantly fewer entries but of higher quality. Secondly, we investigate the non-staticity of the task by performing an analysis of dynamic word embeddings over time.
To use or reproduce our results you need to adopt the following steps.
NOTE: There may be slight variations depending on the terminal and operating system you use. The following example is designed to work using the JupyterNotebook Latex application on UCloud. The terminal code should therefore work using a unix-based bash. Nonetheless, if you use a different IDE or operating system, there may be slight variations and hiccups. Furthermore, it requires that pip is installed and you may also want to create a new virtual environment for this project.
- Clone repository
- Download data and packages
- Run classification analysis
- Run word embedding analysis
Clone repository and install the required packages using the following lines in the unix-based bash:
git clone https://github.com/emiltj/NLP_exam_2021.git
cd NLP_exam_2021
pip install -r requirements.txt
To replicate our results, we have included a bash script that automatically creates folders for the data (both raw and preprocessed) and retrieves it. Follow the code below:
cd NLP_exam_2021
bash download_data.sh
Run notebook Analysis.ipynb
*
* If you are using the preprocessed data, you may simply choose to skip all chunks in the script called "preprocessing" and proceed from the point where the preprocessed data is loaded. Furthermore, due to file-size constraints, the fine-tuned BERT models are not included in this repository and, therefore, need to be trained from scratch. If you wish to retrieve the fine-tuned models, please reach out.
Perform word embedding analysis by running the following in your unix-type bash:
cd word_embeddings/
git clone https://github.com/JohanHorsmans/fastText.git
cd fastText
make
pip install .
cd ..
# Train models by doing the following:
python we.py model --action create
# Find words with highest cosine distance between first and last period (note that this creates a new file "raw.txt", that needs inspecting):
python we.py model --action getCD --fromYear 0010 --toYear 0050
# Get nearest neighbours for given word, in a given period:
python we.py model --action getNN --word [WORD] --period 0010
The different .txt-files correspond to the periods listed in the synopsis (where 0010 is the first period and 0050 is the last period).
This repository has the following structure:
│ Analysis.ipynb
│ Analysis.pdf
│ download_preprocessed.sh
│ LICENSE
│ README.md
│ requirements.txt
│
├───data
│ ├───dataset_1
│ │ fake.csv*
│ │ real.csv*
│ │
│ ├───dataset_2
│ │ *.csv*
│ │
│ │
│ └───generated_data
│ .gitkeep
│ cleaned_dataset_1.csv*
│ cleaned_dataset_2.csv*
│ fake_periods.csv*
│
├───README_images
│ logo_au.png
│ nlp2.png
│
└───word_embeddings
│ .DS_Store
│ LICENSE
│ setup.cfg
│ we.py
│ raw.txt*
│
├───lib
│ │ .DS_Store
│ │ file.py
│ │ metadata.py
│ │ model.py
│ │ text.py
│ │ vector.py
│ │ website.py
│ │ __init__.py
│ │
│ └───websites
│ openbook.py
│ __init__.py
│
└───output
├───models
│
└───texts
0010.txt*
0020.txt*
0030.txt*
0040.txt*
0050.txt*
* These files are not found within this repository, but rather acquired through the steps under the section usage.
For more exhaustive information on the data, see NLP_Exam_Synopsis.pdf
and the original papers.
The raw data for the analysis has originally been retrieved from the links below:
The analysis utilized the Fake News dataset from University of Victoria's research laboratory Information security and object technology (ISOT). Access can be gained through an affiliaty university, or through Kaggle which also holds the dataset. The data had been acquired and first used by Ahmed et al., (2017a); Ahmed et al., (2017b)
The analysis also utilizes the Fake News dataset by Horne et al., (2017). Access is open for anyone.
Distributed under the MIT License. See LICENSE
for more information.
Feel free to contact the authors, Emil Jessen or Johan Horsmans for any questions regarding the scripts.
You may do so through our emails (Emil, Johan)
Furthermore, we would like to extend our gratitude towards the following:
- Barzokas et al. (2019/2021) - Original authors of the repository with the overall framework implemented and modified for our word embedding analysis.
- Horne et al. (2017) for providing data.
- Ahmed et al., (2017, 2018) for providing data.