This is a final project for a short 10 week course in text mining. Coding, visualizations, and overall report were improved after completion of the course, then submitted within a portfolio requirement for graduation. In general, the project attempts to address several items, including the larger question -- Can Market Sentiment Predict the Stock Market?
To address this overall question, different techniques were applied.
- exploratory analysis: topic modeling determines which stock to study
- sentiment analysis: text corpus are normalized into sentiment scores
- granger analysis: find significant sentiment scores and stock index
- timeseries analysis: determine LSTM and ARIMA comparison for sentiment and stock series
While the main focus of the study were between timeseries models, classification analysis was also performed. Specifically, signal analysis was used as the basis for classification:
- signal analysis: apply signal analysis to determine exceeding index points
- classification analysis: TF-IDF text corpus (X) trained against signal results (y)
In general, points exceeding the upper limit threshold was binned a value 1
, while points below the lower threshold was binned a value -1
. This approach provided the target vector (y) when using the TF-IDF corpus (X) during classification:
Note: the above animation was borrowed, and associated code was adjusted to meet the requirements for this study.
While the exact details of the project can be reviewed from the associated write-up.docx
, the remaining segments in this document will remain succinct.
This project requires the following packages:
$ sudo pip install nltk \
matplotlib \
twython \
quandl \
sklearn \
scikit-plot \
statsmodels \
seaborn \
wordcloud \
keras \
numpy \
h5py
Two different datasets were acquired via the Twython and Quandl API:
- financial analyst tweets
- stock market index/volume measures
Due to limitations of the twitter API, roughly 3200 tweets could be collected for a given user timeline. However, the quandl data has a much larger limit. This imposed a limitation upon joining the data. Specifically, only a subset of the quandl dataset was utilized during the analysis.
Original aspiration was to complete the codebase using the app-factory. Due to time constraint, the codebase was not expanded as an application. However, a provided config-TEMPLATE.py
is required at minimum to be copied as config.py
in the same directory. If additional twitter user timelines, or quandl stock index would be studied, the contents of the copied config.py
need to match registered API tokens for each of the service providers. However, to run the codebase to reflect the choices made in this study, then no API keys need to be pasted into the configuration. Instead, additional configurations need to be properly commented out. Specifically, only one analysis can be performed at a given time. Moreover, timeseries sentiment models (consisting of both ARIMA and LSTM) has an added constraint. Specifically, only one stock code can be implemented at a given time:
screen_name = [
'jimcramer',
'ReformedBroker',
'TheStalwart',
'LizAnnSonders',
'SJosephBurns'
]
codes = [
('BATS', 'BATS_AAPL'),
## ('BATS', 'BATS_AMZN'),
## ('BATS', 'BATS_GOOGL'),
## ('BATS', 'BATS_MMT'),
## ('BATS', 'BATS_NFLX'),
## ('CHRIS', 'CBOE_VX1'),
## ('NASDAQOMX', 'COMP-NASDAQ'),
## ('FINRA', 'FNYX_MMM'),
## ('FINRA', 'FNSQ_SPY'),
## ('FINRA', 'FNYX_QQQ'),
## ('EIA', 'PET_RWTC_D'),
## ('WFC', 'PR_CON_15YFIXED_IR'),
## ('WFC', 'PR_CON_30YFIXED_APR')
]
This is largely due to an exponentiating memory requirement, due to keeping multiple trained arima models in memory. Should this codebase be extended to an application, the latter issue could resolve itself. Nevertheless, additional controls can be adjusted in the same config.py
, including the number of epochs, lstm cells, number of neurons (i.e. lstm_units
), signal analysis threshold (i.e. classify_threshold
), and TF-IDF feature reduction for classification (i.e. classify_chi2
) can be made. After dependencies and necessary changes have been made, the script can be executed in a stepwise fashion:
$ pwd
/path/to/web-projects/ist-736
$ python app.py