Dense Sentiment Classifier

In this notebook, we build a dense neural net to classify IMDB movie reviews by their sentiment.

Load dependencies


In [1]:
import keras
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.layers import Embedding # new!
from keras.callbacks import ModelCheckpoint # new! 
import os # new! 
from sklearn.metrics import roc_auc_score, roc_curve # new!
import pandas as pd
import matplotlib.pyplot as plt # new!
%matplotlib inline


Using TensorFlow backend.

Set hyperparameters


In [2]:
# output directory name:
output_dir = 'model_output/dense'

# training:
epochs = 4
batch_size = 128

# vector-space embedding: 
n_dim = 64
n_unique_words = 5000 # as per Maas et al. (2011); may not be optimal
n_words_to_skip = 50 # ditto
max_review_length = 100
pad_type = trunc_type = 'pre'

# neural network architecture: 
n_dense = 64
dropout = 0.5

Load data

For a given data set:

  • the Keras text utilities here quickly preprocess natural language and convert it into an index
  • the keras.preprocessing.text.Tokenizer class may do everything you need in one line:
    • tokenize into words or characters
    • num_words: maximum unique tokens
    • filter out punctuation
    • lower case
    • convert words to an integer index

In [3]:
(x_train, y_train), (x_valid, y_valid) = imdb.load_data(num_words=n_unique_words, skip_top=n_words_to_skip)

In [4]:
x_train[0:6] # 0 reserved for padding; 1 would be starting character; 2 is unknown; 3 is most common word, etc.


Out[4]:
array([ [2, 2, 2, 2, 2, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 2, 173, 2, 256, 2, 2, 100, 2, 838, 112, 50, 670, 2, 2, 2, 480, 284, 2, 150, 2, 172, 112, 167, 2, 336, 385, 2, 2, 172, 4536, 1111, 2, 546, 2, 2, 447, 2, 192, 50, 2, 2, 147, 2025, 2, 2, 2, 2, 1920, 4613, 469, 2, 2, 71, 87, 2, 2, 2, 530, 2, 76, 2, 2, 1247, 2, 2, 2, 515, 2, 2, 2, 626, 2, 2, 2, 62, 386, 2, 2, 316, 2, 106, 2, 2, 2223, 2, 2, 480, 66, 3785, 2, 2, 130, 2, 2, 2, 619, 2, 2, 124, 51, 2, 135, 2, 2, 1415, 2, 2, 2, 2, 215, 2, 77, 52, 2, 2, 407, 2, 82, 2, 2, 2, 107, 117, 2, 2, 256, 2, 2, 2, 3766, 2, 723, 2, 71, 2, 530, 476, 2, 400, 317, 2, 2, 2, 2, 1029, 2, 104, 88, 2, 381, 2, 297, 98, 2, 2071, 56, 2, 141, 2, 194, 2, 2, 2, 226, 2, 2, 134, 476, 2, 480, 2, 144, 2, 2, 2, 51, 2, 2, 224, 92, 2, 104, 2, 226, 65, 2, 2, 1334, 88, 2, 2, 283, 2, 2, 4472, 113, 103, 2, 2, 2, 2, 2, 178, 2],
       [2, 194, 1153, 194, 2, 78, 228, 2, 2, 1463, 4369, 2, 134, 2, 2, 715, 2, 118, 1634, 2, 394, 2, 2, 119, 954, 189, 102, 2, 207, 110, 3103, 2, 2, 69, 188, 2, 2, 2, 2, 2, 249, 126, 93, 2, 114, 2, 2300, 1523, 2, 647, 2, 116, 2, 2, 2, 2, 229, 2, 340, 1322, 2, 118, 2, 2, 130, 4901, 2, 2, 1002, 2, 89, 2, 952, 2, 2, 2, 455, 2, 2, 2, 2, 1543, 1905, 398, 2, 1649, 2, 2, 2, 163, 2, 3215, 2, 2, 1153, 2, 194, 775, 2, 2, 2, 349, 2637, 148, 605, 2, 2, 2, 123, 125, 68, 2, 2, 2, 349, 165, 4362, 98, 2, 2, 228, 2, 2, 2, 1157, 2, 299, 120, 2, 120, 174, 2, 220, 175, 136, 50, 2, 4373, 228, 2, 2, 2, 656, 245, 2350, 2, 2, 2, 131, 152, 491, 2, 2, 2, 2, 1212, 2, 2, 2, 371, 78, 2, 625, 64, 1382, 2, 2, 168, 145, 2, 2, 1690, 2, 2, 2, 1355, 2, 2, 2, 52, 154, 462, 2, 89, 78, 285, 2, 145, 95],
       [2, 2, 2, 2, 2, 2, 2, 2, 249, 108, 2, 2, 2, 54, 61, 369, 2, 71, 149, 2, 2, 112, 2, 2401, 311, 2, 2, 3711, 2, 75, 2, 1829, 296, 2, 86, 320, 2, 534, 2, 263, 4821, 1301, 2, 1873, 2, 89, 78, 2, 66, 2, 2, 360, 2, 2, 58, 316, 334, 2, 2, 1716, 2, 645, 662, 2, 257, 85, 1200, 2, 1228, 2578, 83, 68, 3912, 2, 2, 165, 1539, 278, 2, 69, 2, 780, 2, 106, 2, 2, 1338, 2, 2, 2, 2, 215, 2, 610, 2, 2, 87, 326, 2, 2300, 2, 2, 2, 2, 272, 2, 57, 2, 2, 2, 2, 2, 2, 2307, 51, 2, 170, 2, 595, 116, 595, 1352, 2, 191, 79, 638, 89, 2, 2, 2, 2, 106, 607, 624, 2, 534, 2, 227, 2, 129, 113],
       [2, 2, 2, 2, 2, 2804, 2, 2040, 432, 111, 153, 103, 2, 1494, 2, 70, 131, 67, 2, 61, 2, 744, 2, 3715, 761, 61, 2, 452, 2, 2, 985, 2, 2, 59, 166, 2, 105, 216, 1239, 2, 1797, 2, 2, 2, 2, 744, 2413, 2, 2, 2, 687, 2, 2, 2, 2, 2, 3693, 2, 2, 2, 121, 59, 456, 2, 2, 2, 265, 2, 575, 111, 153, 159, 59, 2, 1447, 2, 2, 586, 482, 2, 2, 96, 59, 716, 2, 2, 172, 65, 2, 579, 2, 2, 2, 1615, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 464, 2, 314, 2, 2, 2, 719, 605, 2, 2, 202, 2, 310, 2, 3772, 3501, 2, 2722, 58, 2, 2, 537, 2116, 180, 2, 2, 413, 173, 2, 263, 112, 2, 152, 377, 2, 537, 263, 846, 579, 178, 54, 75, 71, 476, 2, 413, 263, 2504, 182, 2, 2, 75, 2306, 922, 2, 279, 131, 2895, 2, 2867, 2, 2, 2, 921, 2, 192, 2, 1219, 3890, 2, 2, 217, 4122, 1710, 537, 2, 1236, 2, 736, 2, 2, 61, 403, 2, 2, 2, 61, 4494, 2, 2, 4494, 159, 90, 263, 2311, 4319, 309, 2, 178, 2, 82, 4319, 2, 65, 2, 2, 145, 143, 2, 2, 2, 537, 746, 537, 537, 2, 2, 2, 2, 594, 2, 2, 94, 2, 3987, 2, 2, 2, 2, 538, 2, 1795, 246, 2, 2, 2, 2, 635, 2, 2, 51, 408, 2, 94, 318, 1382, 2, 2, 2, 2683, 936, 2, 2, 2, 2, 2, 2, 2, 1885, 2, 1118, 2, 80, 126, 842, 2, 2, 2, 2, 4726, 2, 4494, 2, 1550, 3633, 159, 2, 341, 2, 2733, 2, 4185, 173, 2, 90, 2, 2, 2, 2, 2, 1784, 86, 1117, 2, 3261, 2, 2, 2, 2, 2, 2, 2841, 2, 2, 1010, 2, 793, 2, 2, 1386, 1830, 2, 2, 246, 50, 2, 2, 2750, 1944, 746, 90, 2, 2, 2, 124, 2, 882, 2, 882, 496, 2, 2, 2213, 537, 121, 127, 1219, 130, 2, 2, 494, 2, 124, 2, 882, 496, 2, 341, 2, 2, 846, 2, 2, 2, 2, 1906, 2, 97, 2, 236, 2, 1311, 2, 2, 2, 2, 2, 2, 2, 91, 2, 3987, 70, 2, 882, 2, 579, 2, 2, 2, 2, 2, 537, 2, 2, 2, 2, 65, 2, 537, 75, 2, 1775, 3353, 2, 1846, 2, 2, 2, 154, 2, 2, 518, 53, 2, 2, 2, 3211, 882, 2, 399, 2, 75, 257, 3807, 2, 2, 2, 2, 456, 2, 65, 2, 2, 205, 113, 2, 2, 2, 2, 2, 2, 2, 242, 2, 91, 1202, 2, 2, 2070, 307, 2, 2, 2, 126, 93, 2, 2, 2, 188, 1076, 3222, 2, 2, 2, 2, 2348, 537, 2, 53, 537, 2, 82, 2, 2, 2, 2, 2, 280, 2, 219, 2, 2, 431, 758, 859, 2, 953, 1052, 2, 2, 2, 2, 94, 2, 2, 238, 60, 2, 2, 2, 804, 2, 2, 2, 2, 132, 2, 67, 2, 2, 2, 2, 283, 2, 2, 2, 2, 2, 242, 955, 2, 2, 279, 2, 2, 2, 1685, 195, 2, 238, 60, 796, 2, 2, 671, 2, 2804, 2, 2, 559, 154, 888, 2, 726, 50, 2, 2, 2, 2, 566, 2, 579, 2, 64, 2574],
       [2, 249, 1323, 2, 61, 113, 2, 2, 2, 1637, 2, 2, 56, 2, 2401, 2, 457, 88, 2, 2626, 1400, 2, 3171, 2, 70, 79, 2, 706, 919, 2, 2, 355, 340, 355, 1696, 96, 143, 2, 2, 2, 289, 2, 61, 369, 71, 2359, 2, 2, 2, 131, 2073, 249, 114, 249, 229, 249, 2, 2, 2, 126, 110, 2, 473, 2, 569, 61, 419, 56, 429, 2, 1513, 2, 2, 534, 95, 474, 570, 2, 2, 124, 138, 88, 2, 421, 1543, 52, 725, 2, 61, 419, 2, 2, 1571, 2, 1543, 2, 2, 2, 2, 2, 296, 2, 3524, 2, 2, 421, 128, 74, 233, 334, 207, 126, 224, 2, 562, 298, 2167, 1272, 2, 2601, 2, 516, 988, 2, 2, 79, 120, 2, 595, 2, 784, 2, 3171, 2, 165, 170, 143, 2, 2, 2, 2, 2, 226, 251, 2, 61, 113],
       [2, 778, 128, 74, 2, 630, 163, 2, 2, 1766, 2, 1051, 2, 2, 85, 156, 2, 2, 148, 139, 121, 664, 665, 2, 2, 1361, 173, 2, 749, 2, 2, 3804, 2, 2, 226, 65, 2, 2, 127, 2, 2, 2, 2]], dtype=object)

In [5]:
for x in x_train[0:6]:
    print(len(x))


218
189
141
550
147
43

In [6]:
y_train[0:6]


Out[6]:
array([1, 0, 0, 1, 0, 0])

In [7]:
len(x_train), len(x_valid)


Out[7]:
(25000, 25000)

Restoring words from index


In [8]:
word_index = keras.datasets.imdb.get_word_index()
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["PAD"] = 0
word_index["START"] = 1
word_index["UNK"] = 2

In [9]:
word_index


Out[9]:
{'fawn': 34704,
 'tsukino': 52009,
 'nunnery': 52010,
 'sonja': 16819,
 'vani': 63954,
 'woods': 1411,
 'spiders': 16118,
 'hanging': 2348,
 'woody': 2292,
 'trawling': 52011,
 "hold's": 52012,
 'comically': 11310,
 'localized': 40833,
 'disobeying': 30571,
 "'royale": 52013,
 "harpo's": 40834,
 'canet': 52014,
 'aileen': 19316,
 'acurately': 52015,
 "diplomat's": 52016,
 'rickman': 25245,
 'arranged': 6749,
 'rumbustious': 52017,
 'familiarness': 52018,
 "spider'": 52019,
 'hahahah': 68807,
 "wood'": 52020,
 'transvestism': 40836,
 "hangin'": 34705,
 'bringing': 2341,
 'seamier': 40837,
 'wooded': 34706,
 'bravora': 52021,
 'grueling': 16820,
 'wooden': 1639,
 'wednesday': 16821,
 "'prix": 52022,
 'altagracia': 34707,
 'circuitry': 52023,
 'crotch': 11588,
 'busybody': 57769,
 "tart'n'tangy": 52024,
 'burgade': 14132,
 'thrace': 52026,
 "tom's": 11041,
 'snuggles': 52028,
 'francesco': 29117,
 'complainers': 52030,
 'templarios': 52128,
 '272': 40838,
 '273': 52031,
 'zaniacs': 52133,
 '275': 34709,
 'consenting': 27634,
 'snuggled': 40839,
 'inanimate': 15495,
 'uality': 52033,
 'bronte': 11929,
 'errors': 4013,
 'dialogs': 3233,
 "yomada's": 52034,
 "madman's": 34710,
 'dialoge': 30588,
 'usenet': 52036,
 'videodrome': 40840,
 "kid'": 26341,
 'pawed': 52037,
 "'girlfriend'": 30572,
 "'pleasure": 52038,
 "'reloaded'": 52039,
 "kazakos'": 40842,
 'rocque': 52040,
 'mailings': 52041,
 'brainwashed': 11930,
 'mcanally': 16822,
 "tom''": 52042,
 'kurupt': 25246,
 'affiliated': 21908,
 'babaganoosh': 52043,
 "noe's": 40843,
 'quart': 40844,
 'kids': 362,
 'uplifting': 5037,
 'controversy': 7096,
 'kida': 21909,
 'kidd': 23382,
 "error'": 52044,
 'neurologist': 52045,
 'spotty': 18513,
 'cobblers': 30573,
 'projection': 9881,
 'fastforwarding': 40845,
 'sters': 52046,
 "eggar's": 52047,
 'etherything': 52048,
 'gateshead': 40846,
 'airball': 34711,
 'unsinkable': 25247,
 'stern': 7183,
 "cervi's": 52049,
 'dnd': 40847,
 'dna': 11589,
 'insecurity': 20601,
 "'reboot'": 52050,
 'trelkovsky': 11040,
 'jaekel': 52051,
 'sidebars': 52052,
 "sforza's": 52053,
 'distortions': 17636,
 'mutinies': 52054,
 'sermons': 30605,
 '7ft': 40849,
 'boobage': 52055,
 "o'bannon's": 52056,
 'populations': 23383,
 'chulak': 52057,
 'mesmerize': 27636,
 'quinnell': 52058,
 'yahoo': 10310,
 'meteorologist': 52060,
 'beswick': 42580,
 'boorman': 15496,
 'voicework': 40850,
 "ster'": 52061,
 'blustering': 22925,
 'hj': 52062,
 'intake': 27637,
 'morally': 5624,
 'jumbling': 40852,
 'bowersock': 52063,
 "'porky's'": 52064,
 'gershon': 16824,
 'ludicrosity': 40853,
 'coprophilia': 52065,
 'expressively': 40854,
 "india's": 19503,
 "post's": 34713,
 'wana': 52066,
 'wang': 5286,
 'wand': 30574,
 'wane': 25248,
 'edgeways': 52324,
 'titanium': 34714,
 'pinta': 40855,
 'want': 181,
 'pinto': 30575,
 'whoopdedoodles': 52068,
 'tchaikovsky': 21911,
 'travel': 2106,
 "'victory'": 52069,
 'copious': 11931,
 'gouge': 22436,
 "chapters'": 52070,
 'barbra': 6705,
 'uselessness': 30576,
 "wan'": 52071,
 'assimilated': 27638,
 'petiot': 16119,
 'most\x85and': 52072,
 'dinosaurs': 3933,
 'wrong': 355,
 'seda': 52073,
 'stollen': 52074,
 'sentencing': 34715,
 'ouroboros': 40856,
 'assimilates': 40857,
 'colorfully': 40858,
 'glenne': 27639,
 'dongen': 52075,
 'subplots': 4763,
 'kiloton': 52076,
 'chandon': 23384,
 "effect'": 34716,
 'snugly': 27640,
 'kuei': 40859,
 'welcomed': 9095,
 'dishonor': 30074,
 'concurrence': 52078,
 'stoicism': 23385,
 "guys'": 14899,
 "beroemd'": 52080,
 'butcher': 6706,
 "melfi's": 40860,
 'aargh': 30626,
 'playhouse': 20602,
 'wickedly': 11311,
 'fit': 1183,
 'labratory': 52081,
 'lifeline': 40862,
 'screaming': 1930,
 'fix': 4290,
 'cineliterate': 52082,
 'fic': 52083,
 'fia': 52084,
 'fig': 34717,
 'fmvs': 52085,
 'fie': 52086,
 'reentered': 52087,
 'fin': 30577,
 'doctresses': 52088,
 'fil': 52089,
 'zucker': 12609,
 'ached': 31934,
 'counsil': 52091,
 'paterfamilias': 52092,
 'songwriter': 13888,
 'shivam': 34718,
 'hurting': 9657,
 'effects': 302,
 'slauther': 52093,
 "'flame'": 52094,
 'sommerset': 52095,
 'interwhined': 52096,
 'whacking': 27641,
 'bartok': 52097,
 'barton': 8778,
 'frewer': 21912,
 "fi'": 52098,
 'ingrid': 6195,
 'stribor': 30578,
 'approporiately': 52099,
 'wobblyhand': 52100,
 'tantalisingly': 52101,
 'ankylosaurus': 52102,
 'parasites': 17637,
 'childen': 52103,
 "jenkins'": 52104,
 'metafiction': 52105,
 'golem': 17638,
 'indiscretion': 40863,
 "reeves'": 23386,
 "inamorata's": 57784,
 'brittannica': 52107,
 'adapt': 7919,
 "russo's": 30579,
 'guitarists': 48249,
 'abbott': 10556,
 'abbots': 40864,
 'lanisha': 17652,
 'magickal': 40866,
 'mattter': 52108,
 "'willy": 52109,
 'pumpkins': 34719,
 'stuntpeople': 52110,
 'estimate': 30580,
 'ugghhh': 40867,
 'gameplay': 11312,
 "wern't": 52111,
 "n'sync": 40868,
 'sickeningly': 16120,
 'chiara': 40869,
 'disturbed': 4014,
 'portmanteau': 40870,
 'ineffectively': 52112,
 "duchonvey's": 82146,
 "nasty'": 37522,
 'purpose': 1288,
 'lazers': 52115,
 'lightened': 28108,
 'kaliganj': 52116,
 'popularism': 52117,
 "damme's": 18514,
 'stylistics': 30581,
 'mindgaming': 52118,
 'spoilerish': 46452,
 "'corny'": 52120,
 'boerner': 34721,
 'olds': 6795,
 'bakelite': 52121,
 'renovated': 27642,
 'forrester': 27643,
 "lumiere's": 52122,
 'gaskets': 52027,
 'needed': 887,
 'smight': 34722,
 'master': 1300,
 "edie's": 25908,
 'seeber': 40871,
 'hiya': 52123,
 'fuzziness': 52124,
 'genesis': 14900,
 'rewards': 12610,
 'enthrall': 30582,
 "'about": 40872,
 "recollection's": 52125,
 'mutilated': 11042,
 'fatherlands': 52126,
 "fischer's": 52127,
 'positively': 5402,
 '270': 34708,
 'ahmed': 34723,
 'zatoichi': 9839,
 'bannister': 13889,
 'anniversaries': 52130,
 "helm's": 30583,
 "'work'": 52131,
 'exclaimed': 34724,
 "'unfunny'": 52132,
 '274': 52032,
 'feeling': 547,
 "wanda's": 52134,
 'dolan': 33269,
 '278': 52136,
 'peacoat': 52137,
 'brawny': 40873,
 'mishra': 40874,
 'worlders': 40875,
 'protags': 52138,
 'skullcap': 52139,
 'dastagir': 57599,
 'affairs': 5625,
 'wholesome': 7802,
 'hymen': 52140,
 'paramedics': 25249,
 'unpersons': 52141,
 'heavyarms': 52142,
 'affaire': 52143,
 'coulisses': 52144,
 'hymer': 40876,
 'kremlin': 52145,
 'shipments': 30584,
 'pixilated': 52146,
 "'00s": 30585,
 'diminishing': 18515,
 'cinematic': 1360,
 'resonates': 14901,
 'simplify': 40877,
 "nature'": 40878,
 'temptresses': 40879,
 'reverence': 16825,
 'resonated': 19505,
 'dailey': 34725,
 '2\x85': 52147,
 'treize': 27644,
 'majo': 52148,
 'kiya': 21913,
 'woolnough': 52149,
 'thanatos': 39800,
 'sandoval': 35734,
 'dorama': 40882,
 "o'shaughnessy": 52150,
 'tech': 4991,
 'fugitives': 32021,
 'teck': 30586,
 "'e'": 76128,
 'doesn’t': 40884,
 'purged': 52152,
 'saying': 660,
 "martians'": 41098,
 'norliss': 23421,
 'dickey': 27645,
 'dicker': 52155,
 "'sependipity": 52156,
 'padded': 8425,
 'ordell': 57795,
 "sturges'": 40885,
 'independentcritics': 52157,
 'tempted': 5748,
 "atkinson's": 34727,
 'hounded': 25250,
 'apace': 52158,
 'clicked': 15497,
 "'humor'": 30587,
 "martino's": 17180,
 "'supporting": 52159,
 'warmongering': 52035,
 "zemeckis's": 34728,
 'lube': 21914,
 'shocky': 52160,
 'plate': 7479,
 'plata': 40886,
 'sturgess': 40887,
 "nerds'": 40888,
 'plato': 20603,
 'plath': 34729,
 'platt': 40889,
 'mcnab': 52162,
 'clumsiness': 27646,
 'altogether': 3902,
 'massacring': 42587,
 'bicenntinial': 52163,
 'skaal': 40890,
 'droning': 14363,
 'lds': 8779,
 'jaguar': 21915,
 "cale's": 34730,
 'nicely': 1780,
 'mummy': 4591,
 "lot's": 18516,
 'patch': 10089,
 'kerkhof': 50205,
 "leader's": 52164,
 "'movie": 27647,
 'uncomfirmed': 52165,
 'heirloom': 40891,
 'wrangle': 47363,
 'emotion\x85': 52166,
 "'stargate'": 52167,
 'pinoy': 40892,
 'conchatta': 40893,
 'broeke': 41131,
 'advisedly': 40894,
 "barker's": 17639,
 'descours': 52169,
 'lots': 775,
 'lotr': 9262,
 'irs': 9882,
 'lott': 52170,
 'xvi': 40895,
 'irk': 34731,
 'irl': 52171,
 'ira': 6890,
 'belzer': 21916,
 'irc': 52172,
 'ire': 27648,
 'requisites': 40896,
 'discipline': 7696,
 'lyoko': 52964,
 'extend': 11313,
 'nature': 876,
 "'dickie'": 52173,
 'optimist': 40897,
 'lapping': 30589,
 'superficial': 3903,
 'vestment': 52174,
 'extent': 2826,
 'tendons': 52175,
 "heller's": 52176,
 'quagmires': 52177,
 'miyako': 52178,
 'moocow': 20604,
 "coles'": 52179,
 'lookit': 40898,
 'ravenously': 52180,
 'levitating': 40899,
 'perfunctorily': 52181,
 'lookin': 30590,
 "lot'": 40901,
 'lookie': 52182,
 'fearlessly': 34873,
 'libyan': 52184,
 'fondles': 40902,
 'gopher': 35717,
 'wearying': 40904,
 "nz's": 52185,
 'minuses': 27649,
 'puposelessly': 52186,
 'shandling': 52187,
 'decapitates': 31271,
 'humming': 11932,
 "'nother": 40905,
 'smackdown': 21917,
 'underdone': 30591,
 'frf': 40906,
 'triviality': 52188,
 'fro': 25251,
 'bothers': 8780,
 "'kensington": 52189,
 'much': 76,
 'muco': 34733,
 'wiseguy': 22618,
 "richie's": 27651,
 'tonino': 40907,
 'unleavened': 52190,
 'fry': 11590,
 "'tv'": 40908,
 'toning': 40909,
 'obese': 14364,
 'sensationalized': 30592,
 'spiv': 40910,
 'spit': 6262,
 'arkin': 7367,
 'charleton': 21918,
 'jeon': 16826,
 'boardroom': 21919,
 'doubts': 4992,
 'spin': 3087,
 'hepo': 53086,
 'wildcat': 27652,
 'venoms': 10587,
 'misconstrues': 52194,
 'mesmerising': 18517,
 'misconstrued': 40911,
 'rescinds': 52195,
 'prostrate': 52196,
 'majid': 40912,
 'climbed': 16482,
 'canoeing': 34734,
 'majin': 52198,
 'animie': 57807,
 'sylke': 40913,
 'conditioned': 14902,
 'waddell': 40914,
 '3\x85': 52199,
 'hyperdrive': 41191,
 'conditioner': 34735,
 'bricklayer': 53156,
 'hong': 2579,
 'memoriam': 52201,
 'inventively': 30595,
 "levant's": 25252,
 'portobello': 20641,
 'remand': 52203,
 'mummified': 19507,
 'honk': 27653,
 'spews': 19508,
 'visitations': 40915,
 'mummifies': 52204,
 'cavanaugh': 25253,
 'zeon': 23388,
 "jungle's": 40916,
 'viertel': 34736,
 'frenchmen': 27654,
 'torpedoes': 52205,
 'schlessinger': 52206,
 'torpedoed': 34737,
 'blister': 69879,
 'cinefest': 52207,
 'furlough': 34738,
 'mainsequence': 52208,
 'mentors': 40917,
 'academic': 9097,
 'stillness': 20605,
 'academia': 40918,
 'lonelier': 52209,
 'nibby': 52210,
 "losers'": 52211,
 'cineastes': 40919,
 'corporate': 4452,
 'massaging': 40920,
 'bellow': 30596,
 'absurdities': 19509,
 'expetations': 53244,
 'nyfiken': 40921,
 'mehras': 75641,
 'lasse': 52212,
 'visability': 52213,
 'militarily': 33949,
 "elder'": 52214,
 'gainsbourg': 19026,
 'hah': 20606,
 'hai': 13423,
 'haj': 34739,
 'hak': 25254,
 'hal': 4314,
 'ham': 4895,
 'duffer': 53262,
 'haa': 52216,
 'had': 69,
 'advancement': 11933,
 'hag': 16828,
 "hand'": 25255,
 'hay': 13424,
 'mcnamara': 20607,
 "mozart's": 52217,
 'duffel': 30734,
 'haq': 30597,
 'har': 13890,
 'has': 47,
 'hat': 2404,
 'hav': 40922,
 'haw': 30598,
 'figtings': 52218,
 'elders': 15498,
 'underpanted': 52219,
 'pninson': 52220,
 'unequivocally': 27655,
 "barbara's": 23676,
 "bello'": 52222,
 'indicative': 13000,
 'yawnfest': 40923,
 'hexploitation': 52223,
 "loder's": 52224,
 'sleuthing': 27656,
 "justin's": 32625,
 "'ball": 52225,
 "'summer": 52226,
 "'demons'": 34938,
 "mormon's": 52228,
 "laughton's": 34740,
 'debell': 52229,
 'shipyard': 39727,
 'unabashedly': 30600,
 'disks': 40404,
 'crowd': 2293,
 'crowe': 10090,
 "vancouver's": 56437,
 'mosques': 34741,
 'crown': 6630,
 'culpas': 52230,
 'crows': 27657,
 'surrell': 53347,
 'flowless': 52232,
 'sheirk': 52233,
 "'three": 40926,
 "peterson'": 52234,
 'ooverall': 52235,
 'perchance': 40927,
 'bottom': 1324,
 'chabert': 53366,
 'sneha': 52236,
 'inhuman': 13891,
 'ichii': 52237,
 'ursla': 52238,
 'completly': 30601,
 'moviedom': 40928,
 'raddick': 52239,
 'brundage': 51998,
 'brigades': 40929,
 'starring': 1184,
 "'goal'": 52240,
 'caskets': 52241,
 'willcock': 52242,
 "threesome's": 52243,
 "mosque'": 52244,
 "cover's": 52245,
 'spaceships': 17640,
 'anomalous': 40930,
 'ptsd': 27658,
 'shirdan': 52246,
 'obscenity': 21965,
 'lemmings': 30602,
 'duccio': 30603,
 "levene's": 52247,
 "'gorby'": 52248,
 "teenager's": 25258,
 'marshall': 5343,
 'honeymoon': 9098,
 'shoots': 3234,
 'despised': 12261,
 'okabasho': 52249,
 'fabric': 8292,
 'cannavale': 18518,
 'raped': 3540,
 "tutt's": 52250,
 'grasping': 17641,
 'despises': 18519,
 "thief's": 40931,
 'rapes': 8929,
 'raper': 52251,
 "eyre'": 27659,
 'walchek': 52252,
 "elmo's": 23389,
 'perfumes': 40932,
 'spurting': 21921,
 "exposition'\x85": 52253,
 'denoting': 52254,
 'thesaurus': 34743,
 "shoot'": 40933,
 'bonejack': 49762,
 'simpsonian': 52256,
 'hebetude': 30604,
 "hallow's": 34744,
 'desperation\x85': 52257,
 'incinerator': 34745,
 'congratulations': 10311,
 'humbled': 52258,
 "else's": 5927,
 'trelkovski': 40848,
 "rape'": 52259,
 "'chapters'": 59389,
 '1600s': 52260,
 'martian': 7256,
 'nicest': 25259,
 'eyred': 52262,
 'passenger': 9460,
 'disgrace': 6044,
 'moderne': 52263,
 'barrymore': 5123,
 'yankovich': 52264,
 'moderns': 40934,
 'studliest': 52265,
 'bedsheet': 52266,
 'decapitation': 14903,
 'slurring': 52267,
 "'nunsploitation'": 52268,
 "'character'": 34746,
 'cambodia': 9883,
 'rebelious': 52269,
 'pasadena': 27660,
 'crowne': 40935,
 "'bedchamber": 52270,
 'conjectural': 52271,
 'appologize': 52272,
 'halfassing': 52273,
 'paycheque': 57819,
 'palms': 20609,
 "'islands": 52274,
 'hawked': 40936,
 'palme': 21922,
 'conservatively': 40937,
 'larp': 64010,
 'palma': 5561,
 'smelling': 21923,
 'aragorn': 13001,
 'hawker': 52275,
 'hawkes': 52276,
 'explosions': 3978,
 'loren': 8062,
 "pyle's": 52277,
 'shootout': 6707,
 "mike's": 18520,
 "driscoll's": 52278,
 'cogsworth': 40938,
 "britian's": 52279,
 'childs': 34747,
 "portrait's": 52280,
 'chain': 3629,
 'whoever': 2500,
 'puttered': 52281,
 'childe': 52282,
 'maywether': 52283,
 'chair': 3039,
 "rance's": 52284,
 'machu': 34748,
 'ballet': 4520,
 'grapples': 34749,
 'summerize': 76155,
 'freelance': 30606,
 "andrea's": 52286,
 '\x91very': 52287,
 'coolidge': 45882,
 'mache': 18521,
 'balled': 52288,
 'grappled': 40940,
 'macha': 18522,
 'underlining': 21924,
 'macho': 5626,
 'oversight': 19510,
 'machi': 25260,
 'verbally': 11314,
 'tenacious': 21925,
 'windshields': 40941,
 'paychecks': 18560,
 'jerk': 3399,
 "good'": 11934,
 'prancer': 34751,
 'prances': 21926,
 'olympus': 52289,
 'lark': 21927,
 'embark': 10788,
 'gloomy': 7368,
 'jehaan': 52290,
 'turaqui': 52291,
 "child'": 20610,
 'locked': 2897,
 'pranced': 52292,
 'exact': 2591,
 'unattuned': 52293,
 'minute': 786,
 'skewed': 16121,
 'hodgins': 40943,
 'skewer': 34752,
 'think\x85': 52294,
 'rosenstein': 38768,
 'helmit': 52295,
 'wrestlemanias': 34753,
 'hindered': 16829,
 "martha's": 30607,
 'cheree': 52296,
 "pluckin'": 52297,
 'ogles': 40944,
 'heavyweight': 11935,
 'aada': 82193,
 'chopping': 11315,
 'strongboy': 61537,
 'hegemonic': 41345,
 'adorns': 40945,
 'xxth': 41349,
 'nobuhiro': 34754,
 'capitães': 52301,
 'kavogianni': 52302,
 'antwerp': 13425,
 'celebrated': 6541,
 'roarke': 52303,
 'baggins': 40946,
 'cheeseburgers': 31273,
 'matras': 52304,
 "nineties'": 52305,
 "'craig'": 52306,
 'celebrates': 13002,
 'unintentionally': 3386,
 'drafted': 14365,
 'climby': 52307,
 '303': 52308,
 'oldies': 18523,
 'climbs': 9099,
 'honour': 9658,
 'plucking': 34755,
 '305': 30077,
 'address': 5517,
 'menjou': 40947,
 "'freak'": 42595,
 'dwindling': 19511,
 'benson': 9461,
 'white’s': 52310,
 'shamelessness': 40948,
 'impacted': 21928,
 'upatz': 52311,
 'cusack': 3843,
 "flavia's": 37570,
 'effette': 52312,
 'influx': 34756,
 'boooooooo': 52313,
 'dimitrova': 52314,
 'houseman': 13426,
 'bigas': 25262,
 'boylen': 52315,
 'phillipenes': 52316,
 'fakery': 40949,
 "grandpa's": 27661,
 'darnell': 27662,
 'undergone': 19512,
 'handbags': 52318,
 'perished': 21929,
 'pooped': 37781,
 'vigour': 27663,
 'opposed': 3630,
 'etude': 52319,
 "caine's": 11802,
 'doozers': 52320,
 'photojournals': 34757,
 'perishes': 52321,
 'constrains': 34758,
 'migenes': 40951,
 'consoled': 30608,
 'alastair': 16830,
 'wvs': 52322,
 'ooooooh': 52323,
 'approving': 34759,
 'consoles': 40952,
 'disparagement': 52067,
 'futureistic': 52325,
 'rebounding': 52326,
 "'date": 52327,
 'gregoire': 52328,
 'rutherford': 21930,
 'americanised': 34760,
 'novikov': 82199,
 'following': 1045,
 'munroe': 34761,
 "morita'": 52329,
 'christenssen': 52330,
 'oatmeal': 23109,
 'fossey': 25263,
 'livered': 40953,
 'listens': 13003,
 "'marci": 76167,
 "otis's": 52333,
 'thanking': 23390,
 'maude': 16022,
 'extensions': 34762,
 'ameteurish': 52335,
 "commender's": 52336,
 'agricultural': 27664,
 'convincingly': 4521,
 'fueled': 17642,
 'mahattan': 54017,
 "paris's": 40955,
 'vulkan': 52339,
 'stapes': 52340,
 'odysessy': 52341,
 'harmon': 12262,
 'surfing': 4255,
 'halloran': 23497,
 'unbelieveably': 49583,
 "'offed'": 52342,
 'quadrant': 30610,
 'inhabiting': 19513,
 'nebbish': 34763,
 'forebears': 40956,
 'skirmish': 34764,
 'ocassionally': 52343,
 "'resist": 52344,
 'impactful': 21931,
 'spicier': 52345,
 'touristy': 40957,
 "'football'": 52346,
 'webpage': 40958,
 'exurbia': 52348,
 'jucier': 52349,
 'professors': 14904,
 'structuring': 34765,
 'jig': 30611,
 'overlord': 40959,
 'disconnect': 25264,
 'sniffle': 82204,
 'slimeball': 40960,
 'jia': 40961,
 'milked': 16831,
 'banjoes': 40962,
 'jim': 1240,
 'workforces': 52351,
 'jip': 52352,
 'rotweiller': 52353,
 'mundaneness': 34766,
 "'ninja'": 52354,
 "dead'": 11043,
 "cipriani's": 40963,
 'modestly': 20611,
 "professor'": 52355,
 'shacked': 40964,
 'bashful': 34767,
 'sorter': 23391,
 'overpowering': 16123,
 'workmanlike': 18524,
 'henpecked': 27665,
 'sorted': 18525,
 "jōb's": 52357,
 "'always": 52358,
 "'baptists": 34768,
 'dreamcatchers': 52359,
 "'silence'": 52360,
 'hickory': 21932,
 'fun\x97yet': 52361,
 'breakumentary': 52362,
 'didn': 15499,
 'didi': 52363,
 'pealing': 52364,
 'dispite': 40965,
 "italy's": 25265,
 'instability': 21933,
 'quarter': 6542,
 'quartet': 12611,
 'padmé': 52365,
 "'bleedmedry": 52366,
 'pahalniuk': 52367,
 'honduras': 52368,
 'bursting': 10789,
 "pablo's": 41468,
 'irremediably': 52370,
 'presages': 40966,
 'bowlegged': 57835,
 'dalip': 65186,
 'entering': 6263,
 'newsradio': 76175,
 'presaged': 54153,
 "giallo's": 27666,
 'bouyant': 40967,
 'amerterish': 52371,
 'rajni': 18526,
 'leeves': 30613,
 'macauley': 34770,
 'seriously': 615,
 'sugercoma': 52372,
 'grimstead': 52373,
 "'fairy'": 52374,
 'zenda': 30614,
 "'twins'": 52375,
 'realisation': 17643,
 'highsmith': 27667,
 'raunchy': 7820,
 'incentives': 40968,
 'flatson': 52377,
 'snooker': 35100,
 'crazies': 16832,
 'crazier': 14905,
 'grandma': 7097,
 'napunsaktha': 52378,
 'workmanship': 30615,
 'reisner': 52379,
 "sanford's": 61309,
 '\x91doña': 52380,
 'modest': 6111,
 "everything's": 19156,
 'hamer': 40969,
 "couldn't'": 52382,
 'quibble': 13004,
 'socking': 52383,
 'tingler': 21934,
 'gutman': 52384,
 'lachlan': 40970,
 'tableaus': 52385,
 'headbanger': 52386,
 'spoken': 2850,
 'cerebrally': 34771,
 "'road": 23493,
 'tableaux': 21935,
 "proust's": 40971,
 'periodical': 40972,
 "shoveller's": 52388,
 'tamara': 25266,
 'affords': 17644,
 'concert': 3252,
 "yara's": 87958,
 'someome': 52389,
 'lingering': 8427,
 "abraham's": 41514,
 'beesley': 34772,
 'cherbourg': 34773,
 'kagan': 28627,
 'snatch': 9100,
 "miyazaki's": 9263,
 'absorbs': 25267,
 "koltai's": 40973,
 'tingled': 64030,
 'crossroads': 19514,
 'rehab': 16124,
 'falworth': 52392,
 'sequals': 52393,
 ...}

In [10]:
index_word = {v:k for k,v in word_index.items()}

In [11]:
x_train[0]


Out[11]:
[2,
 2,
 2,
 2,
 2,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 2,
 173,
 2,
 256,
 2,
 2,
 100,
 2,
 838,
 112,
 50,
 670,
 2,
 2,
 2,
 480,
 284,
 2,
 150,
 2,
 172,
 112,
 167,
 2,
 336,
 385,
 2,
 2,
 172,
 4536,
 1111,
 2,
 546,
 2,
 2,
 447,
 2,
 192,
 50,
 2,
 2,
 147,
 2025,
 2,
 2,
 2,
 2,
 1920,
 4613,
 469,
 2,
 2,
 71,
 87,
 2,
 2,
 2,
 530,
 2,
 76,
 2,
 2,
 1247,
 2,
 2,
 2,
 515,
 2,
 2,
 2,
 626,
 2,
 2,
 2,
 62,
 386,
 2,
 2,
 316,
 2,
 106,
 2,
 2,
 2223,
 2,
 2,
 480,
 66,
 3785,
 2,
 2,
 130,
 2,
 2,
 2,
 619,
 2,
 2,
 124,
 51,
 2,
 135,
 2,
 2,
 1415,
 2,
 2,
 2,
 2,
 215,
 2,
 77,
 52,
 2,
 2,
 407,
 2,
 82,
 2,
 2,
 2,
 107,
 117,
 2,
 2,
 256,
 2,
 2,
 2,
 3766,
 2,
 723,
 2,
 71,
 2,
 530,
 476,
 2,
 400,
 317,
 2,
 2,
 2,
 2,
 1029,
 2,
 104,
 88,
 2,
 381,
 2,
 297,
 98,
 2,
 2071,
 56,
 2,
 141,
 2,
 194,
 2,
 2,
 2,
 226,
 2,
 2,
 134,
 476,
 2,
 480,
 2,
 144,
 2,
 2,
 2,
 51,
 2,
 2,
 224,
 92,
 2,
 104,
 2,
 226,
 65,
 2,
 2,
 1334,
 88,
 2,
 2,
 283,
 2,
 2,
 4472,
 113,
 103,
 2,
 2,
 2,
 2,
 2,
 178,
 2]

In [12]:
' '.join(index_word[id] for id in x_train[0])


Out[12]:
"UNK UNK UNK UNK UNK brilliant casting location scenery story direction everyone's really suited UNK part UNK played UNK UNK could UNK imagine being there robert UNK UNK UNK amazing actor UNK now UNK same being director UNK father came UNK UNK same scottish island UNK myself UNK UNK loved UNK fact there UNK UNK real connection UNK UNK UNK UNK witty remarks throughout UNK UNK were great UNK UNK UNK brilliant UNK much UNK UNK bought UNK UNK UNK soon UNK UNK UNK released UNK UNK UNK would recommend UNK UNK everyone UNK watch UNK UNK fly UNK UNK amazing really cried UNK UNK end UNK UNK UNK sad UNK UNK know what UNK say UNK UNK cry UNK UNK UNK UNK must UNK been good UNK UNK definitely UNK also UNK UNK UNK two little UNK UNK played UNK UNK UNK norman UNK paul UNK were UNK brilliant children UNK often left UNK UNK UNK UNK list UNK think because UNK stars UNK play them UNK grown up UNK such UNK big UNK UNK UNK whole UNK UNK these children UNK amazing UNK should UNK UNK UNK what UNK UNK done don't UNK think UNK whole story UNK UNK lovely because UNK UNK true UNK UNK someone's life after UNK UNK UNK UNK UNK us UNK"

In [13]:
(all_x_train,_),(all_x_valid,_) = imdb.load_data()

In [14]:
' '.join(index_word[id] for id in all_x_train[0])


Out[14]:
"START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

Preprocess data


In [15]:
x_train = pad_sequences(x_train, maxlen=max_review_length, padding=pad_type, truncating=trunc_type, value=0)
x_valid = pad_sequences(x_valid, maxlen=max_review_length, padding=pad_type, truncating=trunc_type, value=0)

In [16]:
x_train[0:6]


Out[16]:
array([[1415,    2,    2,    2,    2,  215,    2,   77,   52,    2,    2,
         407,    2,   82,    2,    2,    2,  107,  117,    2,    2,  256,
           2,    2,    2, 3766,    2,  723,    2,   71,    2,  530,  476,
           2,  400,  317,    2,    2,    2,    2, 1029,    2,  104,   88,
           2,  381,    2,  297,   98,    2, 2071,   56,    2,  141,    2,
         194,    2,    2,    2,  226,    2,    2,  134,  476,    2,  480,
           2,  144,    2,    2,    2,   51,    2,    2,  224,   92,    2,
         104,    2,  226,   65,    2,    2, 1334,   88,    2,    2,  283,
           2,    2, 4472,  113,  103,    2,    2,    2,    2,    2,  178,
           2],
       [ 163,    2, 3215,    2,    2, 1153,    2,  194,  775,    2,    2,
           2,  349, 2637,  148,  605,    2,    2,    2,  123,  125,   68,
           2,    2,    2,  349,  165, 4362,   98,    2,    2,  228,    2,
           2,    2, 1157,    2,  299,  120,    2,  120,  174,    2,  220,
         175,  136,   50,    2, 4373,  228,    2,    2,    2,  656,  245,
        2350,    2,    2,    2,  131,  152,  491,    2,    2,    2,    2,
        1212,    2,    2,    2,  371,   78,    2,  625,   64, 1382,    2,
           2,  168,  145,    2,    2, 1690,    2,    2,    2, 1355,    2,
           2,    2,   52,  154,  462,    2,   89,   78,  285,    2,  145,
          95],
       [1301,    2, 1873,    2,   89,   78,    2,   66,    2,    2,  360,
           2,    2,   58,  316,  334,    2,    2, 1716,    2,  645,  662,
           2,  257,   85, 1200,    2, 1228, 2578,   83,   68, 3912,    2,
           2,  165, 1539,  278,    2,   69,    2,  780,    2,  106,    2,
           2, 1338,    2,    2,    2,    2,  215,    2,  610,    2,    2,
          87,  326,    2, 2300,    2,    2,    2,    2,  272,    2,   57,
           2,    2,    2,    2,    2,    2, 2307,   51,    2,  170,    2,
         595,  116,  595, 1352,    2,  191,   79,  638,   89,    2,    2,
           2,    2,  106,  607,  624,    2,  534,    2,  227,    2,  129,
         113],
       [   2,    2,    2,  188, 1076, 3222,    2,    2,    2,    2, 2348,
         537,    2,   53,  537,    2,   82,    2,    2,    2,    2,    2,
         280,    2,  219,    2,    2,  431,  758,  859,    2,  953, 1052,
           2,    2,    2,    2,   94,    2,    2,  238,   60,    2,    2,
           2,  804,    2,    2,    2,    2,  132,    2,   67,    2,    2,
           2,    2,  283,    2,    2,    2,    2,    2,  242,  955,    2,
           2,  279,    2,    2,    2, 1685,  195,    2,  238,   60,  796,
           2,    2,  671,    2, 2804,    2,    2,  559,  154,  888,    2,
         726,   50,    2,    2,    2,    2,  566,    2,  579,    2,   64,
        2574],
       [   2,    2,  131, 2073,  249,  114,  249,  229,  249,    2,    2,
           2,  126,  110,    2,  473,    2,  569,   61,  419,   56,  429,
           2, 1513,    2,    2,  534,   95,  474,  570,    2,    2,  124,
         138,   88,    2,  421, 1543,   52,  725,    2,   61,  419,    2,
           2, 1571,    2, 1543,    2,    2,    2,    2,    2,  296,    2,
        3524,    2,    2,  421,  128,   74,  233,  334,  207,  126,  224,
           2,  562,  298, 2167, 1272,    2, 2601,    2,  516,  988,    2,
           2,   79,  120,    2,  595,    2,  784,    2, 3171,    2,  165,
         170,  143,    2,    2,    2,    2,    2,  226,  251,    2,   61,
         113],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    2,  778,  128,   74,    2,  630,  163,    2,    2,
        1766,    2, 1051,    2,    2,   85,  156,    2,    2,  148,  139,
         121,  664,  665,    2,    2, 1361,  173,    2,  749,    2,    2,
        3804,    2,    2,  226,   65,    2,    2,  127,    2,    2,    2,
           2]], dtype=int32)

In [17]:
for x in x_train[0:6]:
    print(len(x))


100
100
100
100
100
100

In [18]:
' '.join(index_word[id] for id in x_train[0])


Out[18]:
"cry UNK UNK UNK UNK must UNK been good UNK UNK definitely UNK also UNK UNK UNK two little UNK UNK played UNK UNK UNK norman UNK paul UNK were UNK brilliant children UNK often left UNK UNK UNK UNK list UNK think because UNK stars UNK play them UNK grown up UNK such UNK big UNK UNK UNK whole UNK UNK these children UNK amazing UNK should UNK UNK UNK what UNK UNK done don't UNK think UNK whole story UNK UNK lovely because UNK UNK true UNK UNK someone's life after UNK UNK UNK UNK UNK us UNK"

In [19]:
' '.join(index_word[id] for id in x_train[5])


Out[19]:
'PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD UNK begins better than UNK ends funny UNK UNK russian UNK crew UNK UNK other actors UNK UNK those scenes where documentary shots UNK UNK spoiler part UNK message UNK UNK contrary UNK UNK whole story UNK UNK does UNK UNK UNK UNK'

Design neural network architecture


In [20]:
model = Sequential()
model.add(Embedding(n_unique_words, n_dim, input_length=max_review_length))
model.add(Flatten())
model.add(Dense(n_dense, activation='relu'))
model.add(Dropout(dropout))
# model.add(Dense(n_dense, activation='relu'))
# model.add(Dropout(dropout))
model.add(Dense(1, activation='sigmoid')) # mathematically equivalent to softmax with two classes

In [21]:
model.summary() # so many parameters!


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 64)           320000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 6400)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                409664    
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
=================================================================
Total params: 729,729
Trainable params: 729,729
Non-trainable params: 0
_________________________________________________________________

In [22]:
# embedding layer dimensions and parameters: 
n_dim, n_unique_words, n_dim*n_unique_words


Out[22]:
(64, 5000, 320000)

In [23]:
# ...flatten:
max_review_length, n_dim, n_dim*max_review_length


Out[23]:
(100, 64, 6400)

In [24]:
# ...dense:
n_dense, n_dim*max_review_length*n_dense + n_dense # weights + biases


Out[24]:
(64, 409664)

In [25]:
# ...and output:
n_dense + 1


Out[25]:
65

Configure model


In [26]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [27]:
modelcheckpoint = ModelCheckpoint(filepath=output_dir+"/weights.{epoch:02d}.hdf5")

In [28]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Train!


In [29]:
# 84.7% validation accuracy in epoch 2
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_valid, y_valid), callbacks=[modelcheckpoint])


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 1s - loss: 0.5402 - acc: 0.7095 - val_loss: 0.3701 - val_acc: 0.8367
Epoch 2/4
25000/25000 [==============================] - 0s - loss: 0.2825 - acc: 0.8881 - val_loss: 0.3459 - val_acc: 0.8470
Epoch 3/4
25000/25000 [==============================] - 1s - loss: 0.1279 - acc: 0.9616 - val_loss: 0.4151 - val_acc: 0.8333
Epoch 4/4
25000/25000 [==============================] - 0s - loss: 0.0303 - acc: 0.9941 - val_loss: 0.5222 - val_acc: 0.8322
Out[29]:
<keras.callbacks.History at 0x7fc3b1192ef0>

Evaluate


In [30]:
model.load_weights(output_dir+"/weights.01.hdf5") # zero-indexed

In [31]:
y_hat = model.predict_proba(x_valid)


23552/25000 [===========================>..] - ETA: 0s

In [32]:
len(y_hat)


Out[32]:
25000

In [33]:
y_hat[0]


Out[33]:
array([ 0.88548595], dtype=float32)

In [34]:
plt.hist(y_hat)
_ = plt.axvline(x=0.5, color='orange')



In [35]:
pct_auc = roc_auc_score(y_valid, y_hat)*100.0

In [36]:
"{:0.2f}".format(pct_auc)


Out[36]:
'92.88'

In [37]:
float_y_hat = []
for y in y_hat:
    float_y_hat.append(y[0])

In [38]:
ydf = pd.DataFrame(list(zip(float_y_hat, y_valid)), columns=['y_hat', 'y'])

In [39]:
ydf.head(10)


Out[39]:
y_hat y
0 0.885486 1
1 0.094784 1
2 0.825777 1
3 0.882278 1
4 0.316841 1
5 0.546900 0
6 0.041849 0
7 0.047915 0
8 0.955369 1
9 0.541998 1

In [40]:
' '.join(index_word[id] for id in all_x_valid[0])


Out[40]:
'START how his charter evolved as both man and ape was outstanding not to mention the scenery of the film christopher lambert was astonishing as lord of greystoke christopher is the soul to this masterpiece i became so with his performance i could feel my heart pounding the of the movie still moves me to this day his portrayal of john was oscar worthy as he should have been nominated for it'

In [41]:
' '.join(index_word[id] for id in all_x_valid[6])


Out[41]:
"START this movie is horrible you won't believe this hunk of junk is even a movie was better then this and was pretty frigging bad too a bunch of stupid teens crash in a desert find an old run down bungalow and end up fending off horrifically badly stop motion animated spiders pardon my french but the acting was bad as hell the person who wrote this probably didn't even know what a spider is because he had the spiders living in a colony serving an alien queen ripoff queen spider spiders do not live in colonies this movie is a piece of crud at the end the marines suddenly pop out of no where and kill all the spider without even being called if you see a copy of this movie at a video store it in gasoline and throw a match at it"

In [42]:
ydf[(ydf.y == 0) & (ydf.y_hat > 0.9)].head(10)


Out[42]:
y_hat y
25 0.957828 0
143 0.961272 0
150 0.933923 0
204 0.938723 0
258 0.921509 0
283 0.902109 0
304 0.943218 0
489 0.984264 0
542 0.913043 0
644 0.936996 0

In [43]:
' '.join(index_word[id] for id in all_x_valid[489])


Out[43]:
"START low budget but still creepy enough to hold your interest in another take off on the familiar frankenstein story this movie is also known as lady frankenstein the alluring frankenstein sara bay fresh from medical school arrives at her father's estate to find that he is still up to his old tricks baron frankenstein joseph cotten is murdered by his own creation and now his daughter decides to carry on the family tradition by creating herself a lover this is closer to being an eerie melodrama than horror flick supporting cast features mickey hargitay paul paul muller and herbert a rainy night could amplify the atmosphere still a fun watch"

In [44]:
ydf[(ydf.y == 1) & (ydf.y_hat < 0.1)].head(10)


Out[44]:
y_hat y
1 0.094784 1
101 0.093522 1
151 0.092449 1
239 0.043815 1
426 0.051367 1
543 0.067008 1
732 0.054247 1
926 0.081515 1
927 0.049135 1
951 0.036430 1

In [45]:
' '.join(index_word[id] for id in all_x_valid[927])


Out[45]:
"START i saw it in europe plex great movie br br this film is an exploration of the spirit and the flesh in modern times protagonist jim kirk drives an unwieldy rv across america stopping often to fill his gas guzzling tank he is middle aged and confused he fuels his thick diabetic body with cups of coffee and radio chatter he is the flesh agitated and sometimes spaced out fairly oblivious to the growing tension around him but feeling it as of discomfort br br the spirit the film through speeches and other sounds as well as what appears and goes by in the visual field the spirit eventually collides with the flesh and kirk goes down unable to comprehend what has happened to him he's been in denial about just how bad things have become due to he of all of us because we are all focused on the needs and desires of our flesh we're all in the same denial and so we like kirk are in danger of going down and being blown away by desert sands just like him"

In [ ]: