# Sentiment Analysis with an RNN

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis. Using an RNN rather than a feedfoward network is more accurate since we can include information about the sequence of words. Here we'll use a dataset of movie reviews, accompanied by labels.

The architecture for this network is shown below.

Here, we'll pass in words to an embedding layer. We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the word2vec lesson. You can actually train up an embedding with word2vec and use it here. But it's good enough to just have an embedding layer and let the network learn the embedding table on it's own.

From the embedding layer, the new representations will be passed to LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. We're using the sigmoid because we're trying to predict if this text has positive or negative sentiment. The output layer will just be a single unit then, with a sigmoid activation function.

We don't care about the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.


In [1]:
import numpy as np
import tensorflow as tf

In [2]:
with open('../sentiment-network/reviews.txt', 'r') as f:
    reviews = f.read()
with open('../sentiment-network/labels.txt', 'r') as f:
    labels = f.read()

reviews[:300]


# ## Data preprocessing
# 
# The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.
# 
# You can see an example of the reviews data above. We'll want to get rid of those periods. Also, you might notice that the reviews are delimited with newlines `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. Then I can combined all the reviews back together into one big string.
# 
# First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.


Out[2]:
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the ins'

In [3]:
from string import punctuation
all_text = ''.join([c for c in reviews if c not in punctuation])
reviews = all_text.split('\n')

# list of string, one item one review
print(reviews[:3])

# one big string of all words
all_text = ' '.join(reviews)
words = all_text.split()


['bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   ', 'story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers  unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting  even those from the era should be turned off  the cryptic dialogue would make shakespeare seem easy to a third grader  on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond  future stars sally kirkland and frederic forrest can be seen briefly   ', 'homelessness  or houselessness as george carlin stated  has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school  work  or vote for the matter  most people think of the homeless as just a lost cause while worrying about things such as racism  the war on iraq  pressuring kids to succeed  technology  the elections  inflation  or worrying if they  ll be next to end up on the streets   br    br   but what if you were given a bet to live on the streets for a month without the luxuries you once had from a home  the entertainment sets  a bathroom  pictures on the wall  a computer  and everything you once treasure to see what it  s like to be homeless  that is goddard bolt  s lesson   br    br   mel brooks  who directs  who stars as bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival  jeffery tambor  to see if he can live in the streets for thirty days without the luxuries if bolt succeeds  he can do what he wants with a future project of making more buildings  the bet  s on where bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can  t step off the sidewalk  he  s given the nickname pepto by a vagrant after it  s written on his forehead where bolt meets other characters including a woman by the name of molly  lesley ann warren  an ex  dancer who got divorce before losing her home  and her pals sailor  howard morris  and fumes  teddy wilson  who are already used to the streets  they  re survivors  bolt isn  t  he  s not used to reaching mutual agreements like he once did when being rich where it  s fight or flight  kill or be killed   br    br   while the love connection between molly and bolt wasn  t necessary to plot  i found  life stinks  to be one of mel brooks  observant films where prior to being a comedy  it shows a tender side compared to his slapstick work such as blazing saddles  young frankenstein  or spaceballs for the matter  to show what it  s like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don  t know what to do with their money  maybe they should give it to the homeless instead of using it like monopoly money   br    br   or maybe this film will inspire you to help others   ']

In [4]:
print(all_text[:2000])

# list of all words, one item one word. len=6020196
print(words[:100])
print(len(words))


bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t    story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers  unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting  even those from the era should be turned off  the cryptic dialogue would make shakespeare seem easy to a third grader  on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond  future stars sally kirkland and frederic forrest can be seen briefly    homelessness  or houselessness as george carlin stated  has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school  work  or vote for the matter  most people think of the homeless as just a lost cause while worrying about things such as racism  the war on iraq  pressuring kids to succeed  technology  the elections  inflation  or worrying if they  ll be next to end up on the streets   br    br   but what if you were given a bet to live on the st
['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', 'such', 'as', 'teachers', 'my', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'bromwell', 'high', 's', 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', 'teachers', 'the', 'scramble', 'to', 'survive', 'financially', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', 'pomp', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', 'all', 'remind', 'me', 'of', 'the', 'schools', 'i', 'knew', 'and', 'their', 'students', 'when', 'i', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', 'i', 'immediately', 'recalled', 'at', 'high']
6020196

In [5]:
# Create your dictionary that maps vocab words to integers here
# word set, len=74072
vocab_set = set(words)
print(len(vocab_set))

# dict, vocab to int, int value starts at 1, leave 0 for padding
vocab_to_int = {word : i+1 for i, word in enumerate(vocab_set)}

# Convert the reviews to integers, same shape as reviews list, but with integers
# review list, one item one review, but int represent word
reviews_ints = []
for review in reviews:
    words_in_one_review = review.split()
    one_review_int = []
    for word in words_in_one_review:
        word_int = vocab_to_int[word]
        one_review_int.append(word_int)
    reviews_ints.append(one_review_int)
    
print(reviews_ints[0:2])
print(len(reviews_ints))


74072
[[68529, 5510, 18022, 21997, 28466, 58796, 60701, 32113, 30822, 475, 42856, 10834, 26453, 22145, 13933, 35650, 57687, 4209, 4292, 60115, 26453, 18611, 64775, 43320, 11159, 475, 44869, 23619, 38146, 7031, 3484, 17667, 23693, 68529, 5510, 59484, 29246, 18022, 12899, 14878, 3484, 43136, 37605, 18022, 18611, 475, 7172, 3484, 73693, 403, 475, 42157, 57230, 34202, 57663, 25044, 59721, 40620, 66946, 28205, 18611, 63534, 475, 58852, 41537, 475, 39908, 60798, 42064, 33508, 7031, 41537, 475, 7524, 69997, 71159, 50273, 66946, 57230, 20193, 69997, 45854, 475, 24869, 11159, 12627, 21997, 17453, 44287, 50518, 3484, 45430, 41709, 475, 4209, 69997, 38771, 72309, 30822, 5510, 21997, 5381, 58747, 55177, 69997, 49841, 13042, 3484, 11812, 57244, 41537, 9354, 18611, 17453, 51146, 3484, 68529, 5510, 69997, 37448, 23693, 26501, 54364, 41537, 64775, 40038, 66356, 23693, 68529, 5510, 18022, 11478, 17950, 62882, 21997, 5927, 23693, 60701, 36178, 46316], [63145, 41537, 21997, 35484, 34202, 9367, 4600, 69172, 23843, 21997, 64290, 13009, 47171, 52935, 21997, 29503, 46465, 23693, 18022, 21997, 3731, 30155, 41537, 11277, 58796, 21997, 49485, 17196, 56933, 18022, 68619, 49368, 53603, 26618, 19865, 64215, 74051, 475, 2467, 3443, 41537, 60701, 59484, 56840, 62422, 60701, 59132, 11277, 475, 39908, 10834, 52935, 53989, 2533, 70413, 51460, 56878, 60701, 8972, 65279, 51366, 66111, 13575, 62880, 37707, 475, 68498, 49165, 1238, 68619, 51366, 475, 49607, 32043, 40861, 53694, 10042, 4201, 49869, 3484, 21997, 73629, 6835, 17454, 21997, 30347, 49120, 60701, 59484, 18461, 37605, 18137, 34432, 66356, 52935, 22145, 60719, 50661, 74051, 40528, 53615, 20633, 15500, 40528, 54237, 29925, 27990, 50273, 422, 11076, 57663, 1238, 42466, 37422]]
25001

In [6]:
# split labels
labels = labels.split('\n')
len(labels)


Out[6]:
25001

In [7]:
from collections import Counter
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))


Zero-length reviews: 1
Maximum review length: 2514

In [8]:
# Filter out that review with 0 length
for i, e in enumerate(reviews_ints):
    if(len(e) == 0):
        print(i)
        reviews_ints.remove(e)
        del labels[i]
        break

# make sure reviews and labels still the same size
print(len(reviews_ints))
print(len(labels))


25000
25000
25000

In [9]:
# truncate review to the limitation "seq_len"
# left pad "0" to reviews whose words less than "seq_len"

seq_len = 200
# np array 2d, shape=(num_reviews, seq_len), the reviews are truncated to seq_len
features = np.zeros(shape=(len(reviews_ints), seq_len))
for i, e in enumerate(reviews_ints):
#    e = np.asarray(e)
#    print(i, e[0:3])
    if(len(e) >= seq_len):
        features[i,:] = e[0:seq_len]
    else:
        features[i,seq_len-len(e):] = e
        
print(features[:10,:100])
print(features.shape)


[[     0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.  68529.   5510.  18022.
   21997.  28466.  58796.  60701.  32113.  30822.    475.  42856.  10834.
   26453.  22145.  13933.  35650.  57687.   4209.   4292.  60115.  26453.
   18611.  64775.  43320.  11159.    475.  44869.  23619.  38146.   7031.
    3484.  17667.  23693.  68529.   5510.  59484.  29246.  18022.  12899.
   14878.]
 [     0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.  63145.  41537.  21997.  35484.
   34202.   9367.   4600.  69172.  23843.  21997.  64290.  13009.  47171.
   52935.]
 [ 29156.  30378.  44599.  26453.  19600.  72077.  58306.   9367.   2696.
   53603.  19944.  23843.  43320.  41422.  28513.  21997.  22161.   3484.
   34089.  62880.  17454.    475.  30789.  23693.  43706.  73856.  29379.
   20873.  34202.  74022.  18693.  37707.  47389.   3484.   4209.  18254.
   30378.  33119.  23843.    475.  69173.  45625.  21751.  66356.  41537.
     475.   5296.  26453.   8972.  21997.  45594.  10471.  68865.  59847.
   57687.  23130.  60115.  26453.  26630.    475.  73921.  17454.  12031.
    2910.  67483.   3484.  58232.   5107.    475.  63177.  56477.  30378.
   59847.  33803.  57475.  56018.   1238.  33613.   3484.  60180.  13395.
   17454.    475.  63099.  24052.  24052.  41422.  62882.  33803.  18137.
   43706.  45715.  21997.  69552.   3484.  36225.  17454.    475.  63099.
   23843.]
 [ 44576.  13009.  26453.  21997.  72337.  27634.  26017.   4538.  18022.
   34325.  13395.  52935.  18476.  45370.  60115.  51474.   3484.  36348.
   59395.  44516.  47905.  50428.  25747.  34202.  18022.  47625.  61373.
   21997.  41057.  41537.  20767.  59484.   3484.  70754.  33558.  11159.
   67749.  41537.  60701.   7972.  45154.   3484.    475.  13708.  26453.
   21997.  71992.  45271.  17454.  17797.  18022.  47905.  55211.  44473.
   72348.  51702.  36880.  49010.    475.  26017.  47035.  25713.  51366.
   26453.  39868.  41422.  16074.  44381.    475.   4538.  18022.  24521.
   12367.  74051.    475.  17533.  63600.  59083.  67649.  10778.  70754.
   30239.  36433.  59484.  67603.  32137.  47632.   3992.  11427.  58418.
   34202.  73078.    475.  74010.  62856.  47171.  52935.   3387.  18288.
   57475.]
 [     0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.  36277.  33734.   2461.  74051.
   63474.  27591.  56351.  37487.  46431.  72426.  45898.  69997.  43515.
    1986.  42466.  50273.  67663.  50521.  11159.  37075.   6352.  46914.
   23651.   3484.  67772.    475.  47918.  17454.  65538.  18022.  21997.
    5381.  26453.  60719.  26453.  33643.  11159.  31487.  62144.    475.
   18358.  17454.  45867.  18022.  45271.    268.  38444.   7972.  67977.
   41537.]
 [     0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
   43636.   3593.   9312.  30656.  69997.  15142.  46316.  57346.  64775.
   55552.  17454.  30822.   9169.  67292.  17454.    475.  32440.  41537.
     475.  31308.  52638.  43636.  65741.  14640.   3484.  15453.  41537.
   16661.  20193.   9396.   7045.    475.  11375.  52935.  36880.  31308.
   35484.  13575.    475.  71473.  50521.  33323.  45207.  26453.   7972.
   45457.]
 [     0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.  43636.  18022.  12008.    475.  45625.
   69281.   3593.  51404.    475.  73979.  58731.   4462.  66142.  65859.
   60701.  31608.  23818.  50312.  21997.   5043.  41255.  41537.  29156.
   40596.]
 [     0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.  40782.  18124.  69997.  34603.  43636.  18022.  70707.
    3484.   1238.  53603.   3048.   3593.  41422.  19073.  57475.  49165.
   43515.  22329.  47171.  10384.  30822.    475.  51649.  62096.  21751.
   28823.  16981.  66946.  23092.  47171.  50273.  23818.  17865.  59786.
     475.  46465.  26635.  50273.  48485.  39090.  29181.  31382.  43636.
   63145.  18022.  65279.  45845.   3484.  17865.    475.  12487.  41537.
   21997.]
 [     0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.      0.      0.
       0.      0.      0.      0.      0.      0.      0.  43636.  18022.
   23818.    475.  28522.   9316.  73979.   3593.  60701.  29181.  12899.
   65213.  50285.  37605.  45625.  41537.  70754.  50301.  50273.  63450.
   63830.]
 [ 20193.  69997.  29181.  31005.  64775.  70475.  51953.   7031.  26337.
    3484.    475.  48402.   3484.  25044.  43176.  60701.  29181.  57244.
   41537.  26501.  50301.  69997.  56532.  52935.  64775.  70475.  41422.
   43636.  29181.    475.  33802.  57244.  11015.  48836.  47171.  41537.
   63653.   2566.  69997.  63830.  28513.  42466.  43176.  12636.   8972.
   59943.  50273.  69997.  28823.  43515.  13334.  47171.    475.  59989.
   41537.  64775.   4292.  14391.  60701.  62882.  21997.  28627.  17462.
   50273.  14448.  51316.   7866.  41537.  59484.  40645.  50273.  25332.
   22184.  17168.  16265.  18022.  57244.  41537.  64775.  30270.  58052.
   41422.  43176.  18022.  74051.  11478.    475.  40710.   7866.  41537.
   62816.  41537.  70754.  38030.  11159.    475.  58616.  23646.  41537.
   34242.]]
(25000, 200)

In [10]:
# construct labels array with 0,1
dict_label = {'positive':1, 'negative':0}
labels_int = np.asarray([dict_label[word] for word in labels])
labels_int[:10]


Out[10]:
array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])

In [11]:
# split data set into train, valid, test set

split_frac = 0.8

train_end_index = int(len(features) * 0.8)
train_x, val_x = features[: train_end_index, :], features[train_end_index :, :]
train_y, val_y = labels_int[: train_end_index], labels_int[train_end_index :]

valid_end_index = int(len(features) * 0.8) + int(len(val_x) * 0.5)
val_x, test_x = features[train_end_index : valid_end_index, :], features[valid_end_index :, :]
val_y, test_y = labels_int[train_end_index : valid_end_index], labels_int[valid_end_index :]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))
print("\t\t\tlabels")
print("Train set: \t\t{}".format(train_y.shape), 
      "\nValidation set: \t{}".format(val_y.shape),
      "\nTest set: \t\t{}".format(test_y.shape))


			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)
			labels
Train set: 		(20000,) 
Validation set: 	(2500,) 
Test set: 		(2500,)

In [35]:
# parameters
lstm_size = 256
lstm_layers = 1
dropout_rate = 0.5
batch_size = 500
learning_rate = 0.0005
epochs = 3

# vocab size
n_words = len(vocab_to_int)
# Size of the embedding vectors (number of units in the embedding layer)
embed_size = 300

In [36]:
from keras.layers.embeddings import Embedding
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers import Dense, Dropout

# model
model = Sequential()

# Embedding layer
# input_dim: vocab_size + 1
# output_dim: the embed_size of output
# input_length: seq_len
model.add(Embedding(input_dim=n_words+1, output_dim=embed_size, input_length=seq_len))

# the model will take as input an integer matrix of size (batch, input_length).
# the largest integer (i.e. word index) in the input should be no larger than input_dim.
# now model.output_shape == (None, input_length, output_dim), where None is the batch dimension.

In [ ]:
## unit test, test embedding layer --------------------
## (batch, input_length)
#batch_size = 32
#input_array = np.random.randint(n_words, size=(batch_size, seq_len))
#
#model.compile('rmsprop', 'mse')
#output_array = model.predict(input_array)
#print(output_array.shape)
#assert output_array.shape == (batch_size, seq_len, embed_size)
## unit test end --------------------

In [37]:
#model.add(LSTM(
#    input_dim=embed_size,
#    output_dim=lstm_size,
#    return_sequences=True))
model.add(LSTM(
    input_dim=embed_size,
    output_dim=lstm_size,
    return_sequences=False))
        
#model.add(LSTM(
#    lstm_size,
#    return_sequences=False))
model.add(Dropout(dropout_rate))

#model.add(Dense(20))
model.add(Dense(output_dim=1, activation='sigmoid'))

#model.compile(loss='mse', optimizer='rmsprop', metrics=['accuracy'])
#model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
#model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


C:\Develop\Anaconda3\envs\py35\lib\site-packages\ipykernel_launcher.py:21: UserWarning: The `input_dim` and `input_length` arguments in recurrent layers are deprecated. Use `input_shape` instead.
C:\Develop\Anaconda3\envs\py35\lib\site-packages\ipykernel_launcher.py:21: UserWarning: Update your `LSTM` call to the Keras 2 API: `LSTM(input_shape=(None, 300..., return_sequences=False, units=256)`
C:\Develop\Anaconda3\envs\py35\lib\site-packages\ipykernel_launcher.py:29: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(activation="sigmoid", units=1)`

In [38]:
# training
history = model.fit(
    train_x,
    train_y,
    batch_size=batch_size,
    nb_epoch=epochs,
    validation_data=(val_x, val_y))


C:\Develop\Anaconda3\envs\py35\lib\site-packages\keras\models.py:837: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  warnings.warn('The `nb_epoch` argument in `fit` '
Train on 20000 samples, validate on 2500 samples
Epoch 1/3
20000/20000 [==============================] - 43s - loss: 0.7218 - acc: 0.5787 - val_loss: 0.6731 - val_acc: 0.5888
Epoch 2/3
20000/20000 [==============================] - 39s - loss: 0.5662 - acc: 0.7220 - val_loss: 0.4716 - val_acc: 0.7852
Epoch 3/3
20000/20000 [==============================] - 39s - loss: 0.2557 - acc: 0.9000 - val_loss: 0.4888 - val_acc: 0.8048

In [39]:
# summarize loss and accuracy

# data in history
print(history.history.keys())

import matplotlib.pyplot as plt
# summarize history for loss
plt.figure()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'valid'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.figure()
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'valid'], loc='upper left')


dict_keys(['val_loss', 'loss', 'acc', 'val_acc'])
Out[39]:
<matplotlib.legend.Legend at 0x1c488edc160>

In [40]:
# evaluate on test data
#pred_test = model.predict(test_x)
scores = model.evaluate(test_x, test_y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


Accuracy: 81.32%

In [ ]: