Modeling Stock Market Sentiment with LSTMs and TensorFlow

In this tutorial, we will build a Long Short Term Memory (LSTM) Network to predict the stock market sentiment based on a comment about the market.

Setup

We will use the following libraries for our analysis:

  • numpy - numerical computing library used to work with our data
  • pandas - data analysis library used to read in our data from csv
  • tensorflow - deep learning framework used for modeling

We will also be using the python Counter object for counting our vocabulary items and we have a util module that extracts away a lot of the details of our data processing. Please read through the util.py to get a better understanding of how to preprocess the data for analysis.


In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import utils as utl
from collections import Counter

Processing Data

We will train the model using messages tagged with SPY, the S&P 500 index fund, from StockTwits.com. StockTwits is a social media network for traders and investors to share their views about the stock market. When a user posts a message, they tag the relevant stock ticker ($SPY in our case) and have the option to tag the messages with their sentiment – “bullish” if they believe the stock will go up and “bearish” if they believe the stock will go down.

Our dataset consists of approximately 100,000 messages posted in 2017 that are tagged with $SPY where the user indicated their sentiment. Before we get to our LSTM Network we have to perform some processing on our data to get it ready for modeling.

Read and View Data

First we simply read in our data using pandas, pull out our message and sentiment data into numpy arrays. Let's also take a look at a few samples to get familiar with the data set.


In [2]:
# read data from csv file
data = pd.read_csv("data/StockTwits_SPY_Sentiment_2017.gz",
                   encoding="utf-8",
                   compression="gzip",
                   index_col=0)

# get messages and sentiment labels
messages = data.message.values
labels = data.sentiment.values

# View sample of messages with sentiment

for i in range(10):
    print("Messages: {}...".format(messages[i]),
          "Sentiment: {}".format(labels[i]))


Messages: $SPY crazy day so far!... Sentiment: bearish
Messages: $SPY Will make a new ATH this week. Watch it!... Sentiment: bullish
Messages: $SPY $DJIA white elephant in room is $AAPL. Up 14% since election. Strong headwinds w/Trump trade & Strong dollar. How many 7's do you see?... Sentiment: bearish
Messages: $SPY blocks above. We break above them We should push to double top... Sentiment: bullish
Messages: $SPY Nothing happening in the market today, guess I'll go to the store and spend some $.... Sentiment: bearish
Messages: $SPY What an easy call. Good jobs report: good economy, markets go up.  Bad jobs report: no more rate hikes, markets go up.  Win-win.... Sentiment: bullish
Messages: $SPY BS market.... Sentiment: bullish
Messages: $SPY this rally all the cheerleaders were screaming about this morning is pretty weak. I keep adding 2 my short at all spikes... Sentiment: bearish
Messages: $SPY Dollar ripping higher!... Sentiment: bearish
Messages: $SPY no reason to go down !... Sentiment: bullish

Preprocess Messages

Working with raw text data often requires preprocessing the text in some fashion to normalize for context. In our case we want to normalize for known unique "entities" that appear within messages that carry a similar contextual meaning when analyzing sentiment. This means we want to replace references to specific stock tickers, user names, url links or numbers with a special token identifying the "entity". Here we will also make everything lower case and remove punctuation.


In [3]:
messages = np.array([utl.preprocess_ST_message(message) for message in messages])

Generate Vocab to Index Mapping

To work with raw text we need some encoding from words to numbers for our algorithm to work with the inputs. The first step of doing this is keeping a collection of our full vocabularly and creating a mapping of each word to a unique index. We will use this word to index mapping in a little bit to prep out messages for analysis.

Note that in practice we may want to only include the vocabularly from our training set here to account for the fact that we will likely see new words when our model is out in the wild when we are assessing the results on our validation and test sets. Here, for simplicity and demonstration purposes, we will use our entire data set.


In [4]:
messages[0]


Out[4]:
'<TICKER> crazy day so far'

In [5]:
full_lexicon = " ".join(messages).split()
vocab_to_int, int_to_vocab = utl.create_lookup_tables(full_lexicon)

Check Message Lengths

We will also want to get a sense of the distribution of the length of our inputs. We check for the longest and average messages. We will need to make our input length uniform to feed the data into our model so later we will have some decisions to make about possibly truncating some of the longer messages if they are too long. We also notice that one message has no content remaining after we preprocessed the data, so we will remove this message from our data set.


In [6]:
messages_lens = Counter([len(x) for x in messages])
print("Zero-length messages: {}".format(messages_lens[0]))
print("Maximum message length: {}".format(max(messages_lens)))
print("Average message length: {}".format(np.mean([len(x) for x in messages])))


Zero-length messages: 1
Maximum message length: 244
Average message length: 78.21856920395598

In [7]:
messages, labels = utl.drop_empty_messages(messages, labels)

Encode Messages and Labels

Earlier we mentioned that we need to "translate" our text to number for our algorithm to take in as inputs. We call this translation an encoding. We encode our messages to sequences of numbers where each nummber is the word index from the mapping we made earlier. The phrase "I am bullish" would now look something like [1, 234, 5345] where each number is the index for the respective word in the message. For our sentiment labels we will simply encode "bearish" as 0 and "bullish" as 1.


In [8]:
messages = utl.encode_ST_messages(messages, vocab_to_int)
labels = utl.encode_ST_labels(labels)

Pad Messages

The last thing we need to do is make our message inputs the same length. In our case, the longest message is 244 words. LSTMs can usually handle sequence inputs up to 500 items in length so we won't truncate any of the messages here. We need to Zero Pad the rest of the messages that are shorter. We will use a left padding that will pad all of the messages that are shorter than 244 words with 0s at the beginning. So our encoded "I am bullish" messages goes from [1, 234, 5345] (length 3) to [0, 0, 0, 0, 0, 0, ... , 0, 0, 1, 234, 5345] (length 244).


In [9]:
messages2 = utl.zero_pad_messages(messages, seq_len=244)

In [10]:
mess = [i[-6:-1] for i in messages2]
labels = [i[-1] for i in messages2]

In [11]:
BIG_N = 1600

X = [[i for i in zip(mess[j], np.sqrt(mess[j]))] for j in range(0, BIG_N)]
labels = [labels[j] for j in range(0, BIG_N)]

some_2d_sequences = np.array([*X]).astype(float)
some_2d_labels = np.array(labels).astype(int)
# X

In [12]:
print('shape: n_sequences, len_sequence, dim_input', some_2d_sequences.shape)
print('shape labels: n_sequences, len_labels, dim_input', some_2d_labels.shape)
some_2d_labels[100]


shape: n_sequences, len_sequence, dim_input (1600, 5, 2)
shape labels: n_sequences, len_labels, dim_input (1600,)
Out[12]:
544

Train, Test, Validation Split

The last thing we do is split our data into tranining, validation and test sets and observe the size of each set.


In [13]:
train_x, val_x, test_x, train_y, val_y, test_y = utl.train_val_test_split(some_2d_sequences, some_2d_labels, split_frac=0.80)

print("Data Set Size")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))


Data Set Size
Train set: 		(1280, 5, 2) 
Validation set: 	(160, 5, 2) 
Test set: 		(160, 5, 2)

In [14]:
###### JUST SOME CHALKBOARD STUFF
# first, create a TensorFlow constant
# const = tf.constant(2.0, name="const")
foo = np.array([[1,2,3], [4,4,5]])
inputs_ = tf.constant([], name="train_x")
embedding = tf.Variable(tf.random_uniform((7, 4), -1, 1))
embed = tf.nn.embedding_lookup(embedding, foo)

# create TensorFlow variables
# b = tf.Variable(2.0, name='b')
# c = tf.Variable(1.0, name='c')
# d = tf.add(b, c, name='d')
# e = tf.add(c, const, name='e')
a = tf.multiply(inputs_, 1, name='a')
# setup the variable initialisation
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    # initialise the variables
    sess.run(init_op)
    # compute the output of the graph
    a_out = sess.run(embed)
    print("train_x is: ", train_x[0:2], train_x[0:2].shape)
    print("foo is: ", foo, foo.shape)
    print("SHAPE a is {}".format(a_out.shape))
    print("Variable a is {}".format(a_out))
    
# this proves that embedding_lookup DOES IN FACT TURN your sequence of words into a n-d series, with n = embedding_size
# so this should work for n-d time series


train_x is:  [[[0.00000000e+00 0.00000000e+00]
  [1.00000000e+00 1.00000000e+00]
  [2.96000000e+02 1.72046505e+01]
  [2.00000000e+00 1.41421356e+00]
  [3.40000000e+01 5.83095189e+00]]

 [[7.60000000e+01 8.71779789e+00]
  [1.00000000e+00 1.00000000e+00]
  [1.16300000e+03 3.41027858e+01]
  [6.19000000e+03 7.86765531e+01]
  [1.46300000e+03 3.82491830e+01]]] (2, 5, 2)
foo is:  [[1 2 3]
 [4 4 5]] (2, 3)
SHAPE a is (2, 3, 4)
Variable a is [[[ 0.8473737   0.76758385  0.74345636 -0.24620223]
  [ 0.96984005 -0.10625911  0.5138743   0.97139716]
  [ 0.8453195  -0.61370397 -0.00463629 -0.00317121]]

 [[-0.5286336  -0.8768945   0.12945795  0.07023263]
  [-0.5286336  -0.8768945   0.12945795  0.07023263]
  [ 0.08950806 -0.48774958  0.19337893  0.569     ]]]

Building and Training our LSTM Network

In this section we will define a number of functions that will construct the items in our network. We will then use these functions to build and train our network.

Model Inputs

Here we simply define a function to build TensorFlow Placeholders for our message sequences, our labels and a variable called keep probability associated with drop out (we will talk more about this later).


In [15]:
n_dims = 2 # here, we test for 2 dimensions!
def model_inputs():
    """
    Create the model inputs
    """
    inputs_ = tf.placeholder(tf.float32, [None, None, n_dims], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob_ = tf.placeholder(tf.float32, name='keep_prob')
    
    return inputs_, labels_, keep_prob_

Embedding Layer

In TensorFlow the word embeddings are represented as a vocabulary size x embedding size matrix and will learn these weights during our training process. The embedding lookup is then just a simple lookup from our embedding matrix based on the index of the current word.


In [16]:
def build_embedding_layer(inputs_):
    """
    Create the embedding layer
    """
#     embedding = tf.Variable(tf.random_uniform((vocab_size, embed_size), -1, 1))
#     embed = tf.nn.embedding_lookup(embedding, inputs_)
#     foo = inputs_.astype(float)
    return inputs_

LSTM Layers

TensorFlow makes it extremely easy to build LSTM Layers and stack them on top of each other. We represent each LSTM layer as a BasicLSTMCell and keep these in a list to stack them together later. Here we will define a list with our LSTM layer sizes and the number of layers.

We then take each of these LSTM layers and wrap them in a Dropout Layer. Dropout is a regularization technique using in Neural Networks in which any individual node has a probability of “dropping out” of the network during a given iteration of learning. The makes the model more generalizable by ensuring that it is not too dependent on any given nodes.

Finally, we stack these layers using a MultiRNNCell, generate a zero initial state and connect our stacked LSTM layer to our word embedding inputs using dynamic_rnn. Here we track the output and the final state of the LSTM cell, which we will need to pass between mini-batches during training.


In [17]:
def build_lstm_layers(lstm_sizes, embed, keep_prob_, batch_size):
    """
    Create the LSTM layers
    """
    lstms = [tf.contrib.rnn.BasicLSTMCell(size) for size in lstm_sizes]
    # Add dropout to the cell
    drops = [tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob_) for lstm in lstms]
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell(drops)
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    lstm_outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
    
    return initial_state, lstm_outputs, cell, final_state

Loss Function and Optimizer

First, we get our predictions by passing the final output of the LSTM layers to a sigmoid activation function via a Tensorflow fully connected layer. we only care to use the final output for making predictions so we pull this out using the [: , -1] indexing on our LSTM outputs and pass it through a sigmoid activation function to make the predictions. We pass then pass these predictions to our mean squared error loss function and use the Adadelta Optimizer to minimize the loss.


In [18]:
def build_cost_fn_and_opt(lstm_outputs, labels_, learning_rate):
    """
    Create the Loss function and Optimizer
    """
    predictions = tf.contrib.layers.fully_connected(lstm_outputs[:, -1], 1, activation_fn=tf.sigmoid)
    loss = tf.losses.mean_squared_error(labels_, predictions)
    optimzer = tf.train.AdadeltaOptimizer(learning_rate).minimize(loss)
    
    return predictions, loss, optimzer

Accuracy

Finally, we define our accuracy metric for assessing the model performance across our training, validation and test sets. Even though accuracy is just a calculation based on results, everything in TensorFlow is part of a Computation Graph. Therefore, we need to define our loss and accuracy nodes in the context of the rest of our network graph.


In [19]:
def build_accuracy(predictions, labels_):
    """
    Create accuracy
    """
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    
    return accuracy

Training

We are finally ready to build and train our LSTM Network! First, we call each of our each of the functions we have defined to construct the network. Then we define a Saver to be able to write our model to disk to load for future use. Finally, we call a Tensorflow Session to train the model over a predefined number of epochs using mini-batches. At the end of each epoch we will print the loss, training accuracy and validation accuracy to monitor the results as we train.


In [20]:
def build_and_train_network(lstm_sizes, epochs, batch_size,
                            learning_rate, keep_prob, train_x, val_x, train_y, val_y):
    
    inputs_, labels_, keep_prob_ = model_inputs()
    embed = build_embedding_layer(inputs_)
    initial_state, lstm_outputs, lstm_cell, final_state = build_lstm_layers(lstm_sizes, embed, keep_prob_, batch_size)
    predictions, loss, optimizer = build_cost_fn_and_opt(lstm_outputs, labels_, learning_rate)
    accuracy = build_accuracy(predictions, labels_)
    
    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        
        sess.run(tf.global_variables_initializer())
        n_batches = len(train_x)//batch_size
        for e in range(epochs):
            state = sess.run(initial_state)
            
            train_acc = []
            for ii, (x, y) in enumerate(utl.get_batches(train_x, train_y, batch_size), 1):
                feed = {inputs_: x,
                        labels_: y[:, None],
                        keep_prob_: keep_prob,
                        initial_state: state}
                loss_, state, _,  batch_acc = sess.run([loss, final_state, optimizer, accuracy], feed_dict=feed)
                train_acc.append(batch_acc)
                
                if (ii + 1) % n_batches == 0:
                    
                    val_acc = []
                    val_state = sess.run(lstm_cell.zero_state(batch_size, tf.float32))
                    for xx, yy in utl.get_batches(val_x, val_y, batch_size):
                        feed = {inputs_: xx,
                                labels_: yy[:, None],
                                keep_prob_: 1,
                                initial_state: val_state}
                        val_batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                        val_acc.append(val_batch_acc)
                    
                    print("Epoch: {}/{}...".format(e+1, epochs),
                          "Batch: {}/{}...".format(ii+1, n_batches),
                          "Train Loss: {:.3f}...".format(loss_),
                          "Train Accruacy: {:.3f}...".format(np.mean(train_acc)),
                          "Val Accuracy: {:.3f}".format(np.mean(val_acc)))
    
        saver.save(sess, "checkpoints/sentiment.ckpt")

Next we define our model hyper parameters. We will build a 2 Layer LSTM Newtork with hidden layer sizes of 128 and 64 respectively. We will use an embedding size of 300 and train over 50 epochs with mini-batches of size 256. We will use an initial learning rate of 0.1, though our Adadelta Optimizer will adapt this over time, and a keep probability of 0.5.


In [23]:
# Define Inputs and Hyperparameters
lstm_sizes = [8, 4]
# vocab_size = len(vocab_to_int) + 1 #add one for padding
# embed_size = 30
epochs = 4
batch_size = 16
learning_rate = 0.1
keep_prob = 0.5

and now we train!


In [24]:
with tf.Graph().as_default():
    build_and_train_network(lstm_sizes, epochs, batch_size,
                            learning_rate, keep_prob, train_x, val_x, train_y, val_y)


Epoch: 1/4... Batch: 80/80... Train Loss: 2229614.750... Train Accruacy: 0.100... Val Accuracy: 0.100
Epoch: 2/4... Batch: 80/80... Train Loss: 2229649.250... Train Accruacy: 0.112... Val Accuracy: 0.112
Epoch: 3/4... Batch: 80/80... Train Loss: 2229560.750... Train Accruacy: 0.110... Val Accuracy: 0.112
Epoch: 4/4... Batch: 80/80... Train Loss: 2229586.000... Train Accruacy: 0.115... Val Accuracy: 0.112

Testing our Network

The last thing we want to do is check the model accuracy on our testing data to make sure it is in line with expecations. We build the Computational Graph just like we did before, however, now instead of training we restore our saved model from our checkpoint directory and then run our test data through the model.


In [18]:
def test_network(model_dir, batch_size, test_x, test_y):
    
    inputs_, labels_, keep_prob_ = model_inputs()
    embed = build_embedding_layer(inputs_, vocab_size, embed_size)
    initial_state, lstm_outputs, lstm_cell, final_state = build_lstm_layers(lstm_sizes, embed, keep_prob_, batch_size)
    predictions, loss, optimizer = build_cost_fn_and_opt(lstm_outputs, labels_, learning_rate)
    accuracy = build_accuracy(predictions, labels_)
    
    saver = tf.train.Saver()
    
    test_acc = []
    with tf.Session() as sess:
        saver.restore(sess, tf.train.latest_checkpoint(model_dir))
        test_state = sess.run(lstm_cell.zero_state(batch_size, tf.float32))
        for ii, (x, y) in enumerate(utl.get_batches(test_x, test_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob_: 1,
                    initial_state: test_state}
            batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
            test_acc.append(batch_acc)
        print("Test Accuracy: {:.3f}".format(np.mean(test_acc)))

In [19]:
with tf.Graph().as_default():
    test_network('checkpoints', batch_size, test_x, test_y)


INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test Accuracy: 0.717