This is a very simple attempt at classifying article titles into one of two groups: "clickbait" (a la Buzzfeed and Clickhole) or "news" (a la The New York Times). I was curious if this could be done accurately; I can't think of a good definition for "clickbait" but I know it when I see it.
If you have poetry installed, you shouldn't have to do a thing. You can
install all necessary dependencies and run the demos with poetry run
:
# train the classifier and show the top features
poetry run python -m clickbait_classifier.classifier
# enter an interactive classifier loop
poetry run python -m clickbait_classifier.interactive
If you don't use poetry, you can create a virtualenv, install the dependencies, and then run
the code with pip
:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m clickbait_classifier.classifier
python -m clickbait_classifier.interactive
If you have nix, you can use nix-shell
or nix develop
or direnv
or lorri
to get all the necessary dependencies, including Poetry.
If you use flakes, you can run the demos without installing anything:
# train the classifier and show the top features
nix run github:peterldowns/clickbait-classifier#classifier
# enter an interactive classifier loop
nix run github:peterldowns/clickbait-classifier#interactive
The code is pretty messy, but the general idea is that there is some article
data in the data/
directory, and classifier.py
uses this for training. You can download more data from Buzzfeed and Clickhole using the tools in scripts/
.
python ./scripts/scrape_buzzfeed.py > ./clickbait_classifier/data/buzzfeed2.json
python ./scripts/scrape_clickhole.py > ./clickbait_classifier/data/clickhole2.json
If you feel like testing a few article titles, you can get a simple testing loop like so:
python ./clickbait_classifier/interactive.py
This will load the classifier, train it, and then present you with a simple loop where you can paste in article titles and see the results. You can quit using c-C. For example:
clickbait-classifier/ $ ./interactive.py
Loading classifier (may take time to train.)
Classification report:
precision recall f1-score support
clickbait 0.91 0.62 0.74 172
news 0.90 0.98 0.94 621
avg / total 0.91 0.91 0.90 793
-9.0500 10 things -5.3044 new
-9.0500 11 things -5.7492 bush
-9.0500 13 times -5.8460 overview
-9.0500 15 times -5.9519 iraq
-9.0500 19 puppies -5.9645 war
-9.0500 2014 -5.9828 president
-9.0500 2015 -5.9852 clinton
-9.0500 21 -6.1021 special
-9.0500 23 life -6.1206 nation
-9.0500 24 -6.1464 report
-9.0500 25 -6.1778 campaign
-9.0500 27 -6.2223 china
-9.0500 33 -6.2880 york
-9.0500 35 -6.2880 new york
-9.0500 90s -6.2994 plan
-9.0500 90s kid -6.3191 special report
-9.0500 90s kids -6.3523 says
-9.0500 90s kids rejoice -6.4277 big
-9.0500 90s sitcom -6.4423 challenged
-9.0500 absolute -6.4465 house
Done.
Article title: 43 Reasons 2014 Was The Best Year Ever To Be A Nerd
(95.13% clickbait, 4.87% news) -> clickbait
Article title: Protesters And Police Clash In Missouri For A Second Night
(19.32% clickbait, 80.68% news) -> news
Article title: 29 Christmas Vines That Will Make You Laugh Every Time
(88.25% clickbait, 11.75% news) -> clickbait
Article title: New Subprime Boom Ties Risky Loans to Car Titles
(10.98% clickbait, 89.02% news) -> news
Article title: ^C