This program analyze the sequence using (Uni-directional and Bi-directional) Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) based on the python library Keras (Documents and Github). It is based on this lstm_text_generation.py and this imdb_bidirectional_lstm.py examples of Keras.
This is part of my master thesis project and still in development.
-
NumPy: The fundamental package needed for scientific computing with Python.
-
SciPy: Python-based ecosystem of open-source software for mathematics, science, and engineering.
-
Theano: A Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
-
Tensorflow: An open source software library for numerical computation using data flow graphs.
-
Keras>=1.0: A minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. Update the Keras:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
. -
GPU Support (optional but highly recommended). Instructions of enabling GPU are here: for Theano and for TensorFlow.
-
pydot and graphviz (optional, if you want to plot the model)
-
HDF5 and h5py (optional, if you use model saving/loading functions)
A serias of Recurrent Neural Networks Tutorial:
- Part 1 - Introduction to RNNs
- Part 2 - Implementing a RNN with Python, Numpy and Theano
- Part 3 - Backpropagation Through Time and Vanishing Gradients
- Part 4 - Implementing a GRU/LSTM RNN with Python and Theano
Two great materials about LSTM: Understanding LSTM Networks of Christopher Olah and Understanding LSTM and its diagrams of Shi Yan
The best post of Andrej Karpathy blog regarding sequence prediction using RNN: The Unreasonable Effectiveness of Recurrent Neural Networks
One deeper material about RNN: Chapter 10 - Sequence Modeling: Recurrentand Recursive Nets of this book MIT Deep Learning.
- Two layers of LSTMs Uni-directional RNN model:
- One layer of LSTM Bi-Directional RNN model:
- Naive Bayes model:
naive_bayes.py is a simple Naive Bayes model used for comparison.
-
Training Set
-
Validation Set
-
Test Set
This hyperas may help. It is A very simple convenience wrapper around hyperopt for fast prototyping with keras models. It is used for hyper-parameter optimization. An example can be found here.
Two good materials:
- CHAPTER 3: Improving the way neural networks learn from Michael Nielsen
- Neural Networks Part 2: Setting up the Data and the Loss and Neural Networks Part 3: Learning and Evaluation from Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition
Considerations:
-
Batch Size: how many streams of data are processed in parallel at one time.
-
Samples per epoch and Batches per epoch: how many samples or batches considered per epoch. Based on some of my experiments: (i) the more #samples there are, the higher the accuracy can reach at the stable stage and the less the loss can be at the stable stage; (ii) the more #batches (integer ratio of #sample/batch_size) there are, the higher the accuracy can reach at table stage and the less the loss can be at stable stage and the less iterations it will take to reach the same loss/accuracy value.
-
Sentence Length: according to char-rnn:
The length of each data stream, which is also the limit at which the gradients can propagate backwards in time. For example, if seq_length is 20, then the gradient signal will never backpropagate more than 20 time steps, and the model might not find dependencies longer than this length in number of characters.
This is actually the limitation of the model's long term memory.
Thus, if you have a very difficult dataset where there are a lot of long-term dependencies, you will want to increase this setting.
-
Offset during sampling: offset is the start index when sampling the X_train and y_train from original sequence. The offset can be fixed value or random value ranging between 0 ~ step-1.
-
Data size vs. #parameters in total:
-
#layers: the number of layers, here suggests that always use num_layers of either 2 or 3.
-
layer size: the number of units per layer.
Acoording to char-rnn, the two important quantities to keep track of here are:
- The total number of parameters in your model.
- The size of your dataset. These two should be about the same order of magnitude.
How to calculate the number of parameters in RNN? For example, consider one layer of LSTM:
-
if it has the layer size of
H=512
; -
if we have the vocabulary size as
C=3000
(the number of unique classes); -
the LSTM layer will have three parameter matrix -
U
with dimension(H, C)=(512, 3000)
,V
with dimension(C, H)=(3000, 512)
,W
with dimension(H, H)=(512, 512)
; -
the total number of parameter for one layer will be:
2HC + H^2
, which is 3,334,144 in this case. -
That is 3 million parameters for only one layer!
-
Learning Rate: This ratio (percentage) influences the speed (step of the gradient descent) and quality of learning. The greater the ratio, the faster the neuron trains; the lower the ratio, the more accurate the training is. According to LSTM: A Search Space Odyssey [1]:
The learning rate is by far the most important hyperparameter. And based on their suggestion, while searching for a good learning rate for the LSTM, it is sufficient to do a coarse search by starting with a high value (e.g. 1.0) and dividing it by ten until performance stops increasing.
-
Dropout: an float between 0 and 1, indicating how much percentage of the hidden layer data are ignored when feeding to next layer. It is a powerful regularization method and mainly used for avoiding overfitting. If your model is overfitting, it better to increase the value of dropout.
-
Reinforcement learning function: The temperature parameter is dividing the predicted log probabilities before the Softmax, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.
-
Loss function: categorical_crossentropy
-
Optimizer: RMSprop, you can try other options like simple SGD, Adagrad and Adam.