This repository provides a character-level encoder/trainer for social media posts. See Tweet2Vec paper for details.
There are two models implemented in the paper - the character level tweet2vec and a word level baseline. They can be found in their respective directories, with instructions on how to run. General information about prerequisites and data format can be found below.
- Python 2.7
- Theano and all dependencies (latest)
- Lasagne (latest)
- Numpy
- Maybe more, just use
pip install
if you get an error
Unfortunately we are not allowed to release the data used in experiments from the paper, due to licensing restrictions. Hence, we describe the data format and preprocessing here -
-
Preprocessing - We replace HTML tags, usernames, and URLs from tweet text with special tokens. Hashtags are also removed from the body of a tweet, and re-tweets are discarded. Example code is provided in
misc/preprocess.py
. -
Encoding File Format - If you have a bunch of posts that you want to embed into a vector space, use the
_encoder.sh
scripts provided. The input file must contain one tweet per line (make sure you preprocess these first). An example is provided inmisc/encoder_example.txt
. -
Training File Format - To train the models from scratch, use the
_trainer.sh
scripts provided. The input file must contain one (hashtag,tweet) pair per line separated by a tab. There should be only one tag per line - for tweets with multiple tags split them into separate line. Seemisc/trainer_example.txt
for an example. -
Test/Validation File Format - After training the model, you can test it on a held-out set using
_tester.sh
scripts provided. It has the same format as the training file format, except it can have multiple tags per separated by a comma. Example inmisc/tester_example.txt
.
Make sure to add THEANO_FLAGS=device=cpu,floatX=float32
before any command if you are running on a CPU.
Bhuwan Dhingra, Dylan Fitzpatrick, Zhong Zhou, Michael Muehl. Special thanks to Yun Fu for the preprocessing JAR-file.
If you end up using this code, please cite the following paper -
Dhingra, Bhuwan, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W. Cohen. "Tweet2Vec: Character-Based Distributed Representations for Social Media." ACL (2016).
@InProceedings{dhingra-EtAl:2016:P16-2,
author = {Dhingra, Bhuwan and Zhou, Zhong and Fitzpatrick, Dylan and Muehl, Michael and Cohen, William},
title = {Tweet2Vec: Character-Based Distributed Representations for Social Media},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {August},
year = {2016},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {269--274},
url = {http://anthology.aclweb.org/P16-2044}
}
Report bugs and missing info to bdhingraATandrewDOTcmuDOTedu (replace AT, DOT appropriately).