Practical LSTM example

Adventures in Overfitting

The Unreasonable Effectiveness of LSTM…at overfitting

Use case: a custom news feed - StreetEYE.com


In [2]:
import os
from IPython.display import display, Image
display(Image('./se.png', width=900))


  • Like Hacker News, other aggregation sites/apps
  • Step 1: Follow a lot of people on social media
    • Who are influential (high centrality)
      • For instance one can create a social graph using Twitter API
    • Who frequently share financial news stories
      • That are timely
      • That are topical
      • That tend to go viral since they are timely, topical, and shared by influential people

Social graph force-directed


In [3]:
display(Image('./socialgraph.png', width=656))


  • Step 2: Collect data on which stories were relevant+popular
    • Implicitly/Algorithmically by what people read
    • Explicitly upvote/tag stories that are relevant and popular
    • Bottom line, get a big corpus of headlines, with labels
  • Looks like a supervised learning problem:

    • I have a corpus that's labeled and classified
    • I want to automatically classify stories as they are discovered as probably relevant+popular, or not
  • Stories labeled relevant+popular are typically

    • From a reputable source: more likely Bloomberg/FT than TMZ/Tumblr/YouTube
    • Shared by a lot of people
    • Headline words are topical, market-related (Janet Yellen, not Janet Jackson)
  • Step 3: Dump a data file

    • Label - frontpage or not
    • Domain e.g. bloomberg.com: 1-hot vector of ~100 most frequent (frontpage) domains
    • Who shared it: 1-hot vector of ~1000 most frequent (frontpage) Twitter accounts, blogs, etc., e.g. https://twitter.com/valuewalk http://thereformedbroker.com/

      • And catchall count(*) of other sharers not in top 1000
    • Text of headline

      • Stem it with nltk e.g. doing, does, did -> do
        • Why? Socher says doesn't matter
        • But first attempt was a Naive Bayes/bag-of-words model with frequent/significant words
        • So already have pipeline
        • Can't hurt
        • Socher has more data/server farms than me
      • Concatenate most frequent ngrams e.g. Federal Reserve -> federal_reserve
      • Skip stopwords 'the' 'a' etc.
    • To make classes less unbalanced, sample very common case of headline shared by 1 or 2 folks, not popular/relevant

In [2]:
row = [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,1,
 'what go happen 2017']
len(row)


Out[2]:
1824
  • Step 4: word2vec
    • Why word2vec, not GloVe? Because I have working word2vec from Udacity, even though GloVe seems simpler and better per Socher, couldn't find / didn't write a good GloVe implementation (is it in the Socher materials?)
    • Dump entire headline corpus (last text string) into one long document
    • Take top 10,000 words
    • Initialize using Google News pretrained corpus
    • Train further for an hour or two on my own corpus
    • Works pretty well - Look up 'jpmorgan', 'yellen' etc. using Tensorflow embeddings projector (This is very cool if you haven't played around with before)

In [6]:
display(Image('./projector.png', width=656))


Step 5: Run feedforward NN, each obs is 1-hot + average embeddings

  • Label
  • Giant 1-hot vector ~ 1800 columns
  • As first attempt, I average all the embeddings in each headline
  • Now I have ~2100 columns and a label
  • Run this through neural net

In [ ]:
# function to generate model

def create_model(nn_size=30, nn_reg_penalty=0.0, nn_dropout=(1.0/3.0)):
    # create model
    model = Sequential()
    
    model.add(Dense(nn_size, 
                    activation='relu',
                    kernel_initializer='TruncatedNormal', 
                    kernel_regularizer=l1(nn_reg_penalty),
                    input_shape=(NUM_FEATURES,)
                   ))
    
    model.add(Dropout(nn_dropout))

    model.add(Dense(1, 
                    activation='sigmoid',
                    kernel_initializer='TruncatedNormal', 
                    kernel_regularizer=l1(nn_reg_penalty)
                   ))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', f_score])
    print(model.summary())
    return model
  • 1 hidden layer, relu activation
  • Hyperparameters:
    • NN size
    • L1 regularization penalty
    • Dropout
  • Do a grid search

In [11]:
display(Image('./grid.png', width=368))



In [ ]:
Pretty good result
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_129 (Dense)            (None, 64)                135872    
_________________________________________________________________
dropout_65 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_130 (Dense)            (None, 1)                 65        
=================================================================
Total params: 135,937.0
Trainable params: 135,937.0
Non-trainable params: 0.0
_________________________________________________________________
None
12:38:45 Starting
12:38:45 epoch 0 of 200
Train on 130321 samples, validate on 43440 samples
Epoch 1/1
130321/130321 [==============================] - 9s - loss: 0.5374 - acc: 0.9660 - f_score: 0.0077 - val_loss: 0.2127 - val_acc: 0.9724 - val_f_score: 0.0000e+00
...
13:05:05 epoch 199 of 200
Train on 130321 samples, validate on 43440 samples
Epoch 1/1
130321/130321 [==============================] - 7s - loss: 0.0774 - acc: 0.9852 - f_score: 0.6018 - val_loss: 0.0730 - val_acc: 0.9864 - val_f_score: 0.6486
Best Xval loss epoch 133, value 0.072672
NN units 64
Reg_penalty 0.00010000
Dropout 0.5000
Final Train Accuracy 0.988, Train F1 0.760, f_score 0.809 (beta 0.667)
[[126403   1266]
 [   253   2399]]
Final Xval Accuracy 0.987, Xval F1 0.725, f_score 0.767 (beta 0.667)
[[42100   438]
 [  140   762]]
Raw score 2 0.01431860


Test Accuracy 0.987, Test F1 0.731
[[42132   418]
 [  136   754]]

Motivation for LSTM

  • Pretty good result but…
  • Ideally would like to train end to end, not pipeline training of embeddings and NN separately in a waterfall
  • Extend existing architecture with an end-to-end network?
  • Embeddings -> neural network -> I need a layer that takes a headline and averages up the embedding vectors. Maybe there's a way? I don't know
  • The headline is a variable length stream, which seems like a job for LSTM
  • LSTM can process variable length stream of tokens, not an average which may fuzz good info, can use word order
  • Create a data dump:
1,domain_otherdomain,subsource_howardlindzon,subsource_jyarow,subsource_ReformedBroker,subsource_NickatFP,subsource_mathewi,subsource_othersubsource,subsource_LongShortTrader,subsource_DKThomp,subsource_Justin_B_Smith,source_Abnormal_Returns,what,go,happen,2017
* label
* domain encoded as a word token
* sources encoded as word tokens
* headline tokens
  • Effectively turn the data row with 1-hot vectors into a word sentence
  • Dump all the tokens into a big file (actually much smaller than 1-hot vectors), train them with word2vec (possibly redundant step)

  • Create LSTM model like this

Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 120, 300)          3000300   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               219648    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
=================================================================
Total params: 3,220,077.0
Trainable params: 3,220,077.0
Non-trainable params: 0.0

In [ ]:
# function to generate model

def create_model(lstm_size=30, lstm_reg_penalty=0.0, sigmoid_dropout=(1.0/3.0), sigmoid_reg_penalty=0.0001):
    # create model
    model = Sequential()

    model.add(Embedding(len(dictionary) + 1, 
                        embedding_vector_length, 
                        weights=[embedding_matrix],
                        input_length=MAX_LENGTH,
                        trainable=True))
    
    # LSTM with lstm_size units
    model.add(LSTM(lstm_size,
                   kernel_regularizer=l1(lstm_reg_penalty)))
    model.add(Dropout(sigmoid_dropout))
    
    model.add(Dense(1, 
                    activation='sigmoid',
                    kernel_initializer='TruncatedNormal', 
                    kernel_regularizer=l1(sigmoid_reg_penalty)))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    print(model.summary())
    return model
  • Hyperparameters:

    • LSTM network size
    • L1 regularization, didn't find it to be helpful
    • Sigmoid dropout
    • Sigmoid L1 regularization
  • Do a grid search, generally overtrains after very few epochs (near-perfect accuracy in training, mediocre performance in xval/test)

    Train Accuracy 0.999, Train F1 0.975, f_score 0.981 (beta 0.667)
    [[127430    144]
    [    30   3447]]
    LSTM units 16
    LSTM reg_penalty 0.00000000
    Sigmoid dropout 0.5000
    Sigmoid reg_penalty 0.00003000
    Xval Accuracy 0.981, Xval F1 0.627, f_score 0.670 (beta 0.667)
    [[42188   590]
    [  223   683]]
  • Instead, set the embeddings trainable=False

    • Still overfits, takes a little longer, few more epochs
  • Finally

    • Run some epochs with embeddings trainable=True, pick model with minimum xval loss
    • Go back one epoch to hopefully avoid a model that's already overfitted
    • Freeze embeddings layer
      • set trainable = False
      • recompile
      • train further
      • pick model with minimum xval loss
    • Also doesn't work as well as NN (a bit of a shock to me. Even if I stop training the embeddings before xval loss reaches a minimu, training f-score is still > 0.9 indicating overfitting. Further training, after freezing embeddings and only training LSTM and sigmoid layer, usually makes xval loss worse (!?) than before freezing embeddings. I could understand if training whole model end-to-end resulted in a lower minimum xval loss, but not how training a subset of parameters could make xval loss worse than before freezing embeddings/resuming training.)
  • Pretty good result but not as good as NN

  • As an aside, there is a mixed network in keras examples https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models
  • Feed headline embeddings only to lstm
  • Take output of LSTM, concatenate to one-hot vectors
  • Feed resulting vector combining low-dimensional LSTM digest of headline and metadata to NN or sigmoid layer
  • This would take e.g. order of sharing out of the equation
  • Tried this…inferior result so far (even though this is similar to NN, just substitutes LSTM digest for averaging all the headline embeddings…should be more predictive since it trains the embeddings and LSTM against labels)
  • So it goes… ¯\(ツ)

Takeaways

  • LSTM works
    • Fits very well to training data pretty quickly
    • Gets a pretty good result
  • Overfitting is the issue here - very high training f-score
  • The good thing about LSTM is it uses the entire variable length input, including word order.
  • The bad thing is … it uses the entire variable length input, including word order. In this case I think it's able to associate sequences of words with specific rows and labels in a way that doesn't generalize. The 'implicit dimensionality' of an ordered sequence of tokens is very large and you run out of degrees of freedom.
  • L1 regularization not helpful
  • Can reduce network size but seems to worsen result
  • It's key (as always) to stop training when xval loss stops improving
  • Feedforward NN better result/much simpler faster model

In [ ]: