-
filter.py
This script will go through all theJSON
files indataset
folder, and will only store the tweet if it matches following criterias:
-extended_tweet
is NOT null
-lang
isen
(English)
- Tweet contains word(s) defined inkeyWords
list
It will not store all the details of a particular tweets, but only the features we require for our purpose:
- Twitter User Desciption
- Tweet
All this information will be stored incsv
format (saved asall_data.csv
). -
label.py
Since we need to manually annotate all the selected tweets, this script will provide a simple command line interface to help with that.
This will present the user with a tweet (fromall_data.csv
, line by line), user will input1
or0
where:
-1
: Tweet is migration relevant
-0
: Tweet is NOT migration relevant
Once the user will hit enter, label will be stored intrain_label.csv
. -
annotation.ipynb
This notebook trains and performs evaluation on the labelled data.
Pipeline (for now):
- Import data, and remove rows with null values in any columns
- Balance the dataset using SMOTE
- PrepareTF-IDF
andDoc2Vec
feature extraction techniques
- Provide appropriate data and labels to both the techniques, train classifiers using retrieved feature vectors
- Perform classification on a seperate validation set
- Print and Plot results!
forked from harshildarji/DataScienceLab
-
Notifications
You must be signed in to change notification settings - Fork 0
ShrikanthSingh/DataScienceLab
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Data Science Lab - SS - 2019
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Jupyter Notebook 98.3%
- Python 1.7%