Parsing the imdb database


In [1]:
# Based on
# https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.1-using-word-embeddings.ipynb
# https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
%matplotlib inline
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [4]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)
print(tf.__version__)


1.8.0

Download data:


In [2]:
!ls -l C:/Users/olive/Development/data/aclImdb


total 1720
-rw-r--r-- 1 olive 197609 845980 Apr 12  2011 imdb.vocab
-rw-r--r-- 1 olive 197609 903029 Jun 12  2011 imdbEr.txt
-rw-r--r-- 1 olive 197609   4037 Jun 26  2011 README
drwxr-xr-x 1 olive 197609      0 Apr 12  2011 test
drwxr-xr-x 1 olive 197609      0 Jun 26  2011 train

In [4]:
!cat C:/Users/olive/Development/data/aclImdb/README


Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. In the
unsupervised set, reviews of any rating are included and there are an
even number of reviews > 5 and <= 5.

Files

There are two top-level directories [train/, test/] corresponding to
the training and test sets. Each contains [pos/, neg/] directories for
the reviews with binary labels positive and negative. Within these
directories, reviews are stored in text files named following the
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
the star rating for that review on a 1-10 scale. For example, the file
[test/pos/200_8.txt] is the text for a positive-labeled test set
example with unique id 200 and star rating 8/10 from IMDb. The
[train/unsup/] directory has 0 for all ratings because the ratings are
omitted for this portion of the dataset.

We also include the IMDb URLs for each review in a separate
[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
have its URL on line 200 of this file. Due the ever-changing IMDb, we
are unable to link directly to the review, but only to the movie's
review page.

In addition to the review text files, we include already-tokenized bag
of words (BoW) features that were used in our experiments. These 
are stored in .feat files in the train/test directories. Each .feat
file is in LIBSVM format, an ascii sparse-vector format for labeled
data.  The feature indices in these files start from 0, and the text
tokens corresponding to a feature index is found in [imdb.vocab]. So a
line with 0:7 in a .feat file means the first word in [imdb.vocab]
(the) appears 7 times in that review.

LIBSVM page for details on .feat file format:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

We also include [imdbEr.txt] which contains the expected rating for
each token in [imdb.vocab] as computed by (Potts, 2011). The expected
rating is a good way to get a sense for the average polarity of a word
in the dataset.

Citing the dataset

When using this dataset please cite our ACL 2011 paper which
introduces it. This paper also contains classification results which
you may want to compare against.


@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.

Contact

For questions/comments/corrections please contact Andrew Maas
amaas@cs.stanford.edu

In [6]:
# !ls -l C:/Users/olive/Development/data/aclImdb/train

Load Database


In [7]:
import os

imdb_dir = 'C:/Users/olive/Development/data/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname), encoding='UTF-8')
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

In [8]:
len(texts)


Out[8]:
25000

In [9]:
texts[0]


Out[9]:
"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [10]:
# 0: neg, 1: pos
labels[0]


Out[10]:
0

In [11]:
texts[15000]


Out[11]:
"Kate Beckinsale steals the show! Bravo! Too bad Knightly ins't as good looking as Jeremy Northam. Mark Strong did a fabulous job. Bernard Hepton was perfect as Emmas father. I love the end scene (which is an addition to the novel-but well written) when the harvest is in and Knightly dines with his workers and high society friends. Emma must show that she accepts this now. She is a changed woman. That is too much too quick, but OK. I'll buy into it. Samantha Bond plays Emma's ex-governess and confidant. She is wonderful. just as I would have imagined her. I believe that when the UK does a Jane Austen its the best. American versions of English literature are done for money and not for quality. See this one!"

In [12]:
labels[15000]


Out[12]:
1

Transform each review into exactly 500 words having a vocabulary of 10000 words


In [13]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 500  # We will cut reviews after 500 words
max_words = 10000  # We will only consider the top 10,000 words in the dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)


Using TensorFlow backend.

In [19]:
# Tokenizer?

In [22]:
tokenizer.word_index


Out[22]:
{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'he': 26,
 'be': 27,
 'one': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'who': 34,
 'so': 35,
 'from': 36,
 'like': 37,
 'her': 38,
 'or': 39,
 'just': 40,
 'about': 41,
 "it's": 42,
 'out': 43,
 'has': 44,
 'if': 45,
 'some': 46,
 'there': 47,
 'what': 48,
 'good': 49,
 'more': 50,
 'when': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'she': 56,
 'even': 57,
 'my': 58,
 'would': 59,
 'which': 60,
 'only': 61,
 'story': 62,
 'really': 63,
 'see': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'were': 68,
 'me': 69,
 'well': 70,
 'than': 71,
 'we': 72,
 'much': 73,
 'been': 74,
 'bad': 75,
 'get': 76,
 'will': 77,
 'do': 78,
 'also': 79,
 'into': 80,
 'people': 81,
 'other': 82,
 'first': 83,
 'great': 84,
 'because': 85,
 'how': 86,
 'him': 87,
 'most': 88,
 "don't": 89,
 'made': 90,
 'its': 91,
 'then': 92,
 'way': 93,
 'make': 94,
 'them': 95,
 'too': 96,
 'could': 97,
 'any': 98,
 'movies': 99,
 'after': 100,
 'think': 101,
 'characters': 102,
 'watch': 103,
 'two': 104,
 'films': 105,
 'character': 106,
 'seen': 107,
 'many': 108,
 'being': 109,
 'life': 110,
 'plot': 111,
 'never': 112,
 'acting': 113,
 'little': 114,
 'best': 115,
 'love': 116,
 'over': 117,
 'where': 118,
 'did': 119,
 'show': 120,
 'know': 121,
 'off': 122,
 'ever': 123,
 'does': 124,
 'better': 125,
 'your': 126,
 'end': 127,
 'still': 128,
 'man': 129,
 'here': 130,
 'these': 131,
 'say': 132,
 'scene': 133,
 'while': 134,
 'why': 135,
 'scenes': 136,
 'go': 137,
 'such': 138,
 'something': 139,
 'through': 140,
 'should': 141,
 'back': 142,
 "i'm": 143,
 'real': 144,
 'those': 145,
 'watching': 146,
 'now': 147,
 'though': 148,
 "doesn't": 149,
 'years': 150,
 'old': 151,
 'thing': 152,
 'actors': 153,
 'work': 154,
 '10': 155,
 'before': 156,
 'another': 157,
 "didn't": 158,
 'new': 159,
 'funny': 160,
 'nothing': 161,
 'actually': 162,
 'makes': 163,
 'director': 164,
 'look': 165,
 'find': 166,
 'going': 167,
 'few': 168,
 'same': 169,
 'part': 170,
 'again': 171,
 'every': 172,
 'lot': 173,
 'cast': 174,
 'us': 175,
 'quite': 176,
 'down': 177,
 'want': 178,
 'world': 179,
 'things': 180,
 'pretty': 181,
 'young': 182,
 'seems': 183,
 'around': 184,
 'got': 185,
 'horror': 186,
 'however': 187,
 "can't": 188,
 'fact': 189,
 'take': 190,
 'big': 191,
 'enough': 192,
 'long': 193,
 'thought': 194,
 "that's": 195,
 'both': 196,
 'between': 197,
 'series': 198,
 'give': 199,
 'may': 200,
 'original': 201,
 'action': 202,
 'own': 203,
 "i've": 204,
 'right': 205,
 'without': 206,
 'always': 207,
 'times': 208,
 'comedy': 209,
 'point': 210,
 'gets': 211,
 'must': 212,
 'come': 213,
 'role': 214,
 "isn't": 215,
 'saw': 216,
 'almost': 217,
 'interesting': 218,
 'least': 219,
 'family': 220,
 'done': 221,
 "there's": 222,
 'whole': 223,
 'bit': 224,
 'music': 225,
 'script': 226,
 'far': 227,
 'making': 228,
 'anything': 229,
 'guy': 230,
 'minutes': 231,
 'feel': 232,
 'last': 233,
 'since': 234,
 'might': 235,
 'performance': 236,
 "he's": 237,
 '2': 238,
 'probably': 239,
 'kind': 240,
 'am': 241,
 'away': 242,
 'yet': 243,
 'rather': 244,
 'tv': 245,
 'worst': 246,
 'girl': 247,
 'day': 248,
 'sure': 249,
 'fun': 250,
 'hard': 251,
 'woman': 252,
 'played': 253,
 'each': 254,
 'found': 255,
 'anyone': 256,
 'having': 257,
 'although': 258,
 'especially': 259,
 'our': 260,
 'course': 261,
 'believe': 262,
 'comes': 263,
 'looking': 264,
 'screen': 265,
 'trying': 266,
 'set': 267,
 'goes': 268,
 'looks': 269,
 'place': 270,
 'book': 271,
 'different': 272,
 'put': 273,
 'ending': 274,
 'money': 275,
 'maybe': 276,
 'once': 277,
 'sense': 278,
 'reason': 279,
 'true': 280,
 'actor': 281,
 'everything': 282,
 "wasn't": 283,
 'shows': 284,
 'dvd': 285,
 'three': 286,
 'worth': 287,
 'year': 288,
 'job': 289,
 'main': 290,
 'someone': 291,
 'together': 292,
 'watched': 293,
 'play': 294,
 'plays': 295,
 'american': 296,
 '1': 297,
 'said': 298,
 'effects': 299,
 'later': 300,
 'takes': 301,
 'instead': 302,
 'seem': 303,
 'beautiful': 304,
 'john': 305,
 'himself': 306,
 'version': 307,
 'audience': 308,
 'high': 309,
 'house': 310,
 'night': 311,
 'during': 312,
 'everyone': 313,
 'left': 314,
 'special': 315,
 'seeing': 316,
 'half': 317,
 'excellent': 318,
 'wife': 319,
 'star': 320,
 'shot': 321,
 'war': 322,
 'idea': 323,
 'nice': 324,
 'black': 325,
 'less': 326,
 'mind': 327,
 'simply': 328,
 'read': 329,
 'second': 330,
 'else': 331,
 "you're": 332,
 'father': 333,
 'fan': 334,
 'help': 335,
 'poor': 336,
 'completely': 337,
 'death': 338,
 '3': 339,
 'used': 340,
 'home': 341,
 'either': 342,
 'short': 343,
 'line': 344,
 'given': 345,
 'men': 346,
 'top': 347,
 'dead': 348,
 'budget': 349,
 'try': 350,
 'performances': 351,
 'wrong': 352,
 'classic': 353,
 'boring': 354,
 'enjoy': 355,
 'need': 356,
 'rest': 357,
 'use': 358,
 'hollywood': 359,
 'kids': 360,
 'low': 361,
 'production': 362,
 'until': 363,
 'along': 364,
 'full': 365,
 'friends': 366,
 'camera': 367,
 'truly': 368,
 'women': 369,
 'awful': 370,
 'video': 371,
 'next': 372,
 'tell': 373,
 'remember': 374,
 'couple': 375,
 'stupid': 376,
 'start': 377,
 'stars': 378,
 'perhaps': 379,
 'mean': 380,
 'sex': 381,
 'came': 382,
 'recommend': 383,
 'let': 384,
 'moments': 385,
 'wonderful': 386,
 'episode': 387,
 'understand': 388,
 'small': 389,
 'face': 390,
 'terrible': 391,
 'playing': 392,
 'school': 393,
 'getting': 394,
 'written': 395,
 'often': 396,
 'doing': 397,
 'keep': 398,
 'early': 399,
 'name': 400,
 'perfect': 401,
 'style': 402,
 'human': 403,
 'definitely': 404,
 'gives': 405,
 'others': 406,
 'itself': 407,
 'lines': 408,
 'live': 409,
 'become': 410,
 'dialogue': 411,
 'person': 412,
 'lost': 413,
 'finally': 414,
 'piece': 415,
 'head': 416,
 'felt': 417,
 'case': 418,
 'yes': 419,
 'liked': 420,
 'supposed': 421,
 'title': 422,
 "couldn't": 423,
 'absolutely': 424,
 'white': 425,
 'against': 426,
 'boy': 427,
 'picture': 428,
 'sort': 429,
 'worse': 430,
 'certainly': 431,
 'went': 432,
 'entire': 433,
 'waste': 434,
 'cinema': 435,
 'problem': 436,
 'hope': 437,
 'entertaining': 438,
 "she's": 439,
 'mr': 440,
 'overall': 441,
 'evil': 442,
 'called': 443,
 'loved': 444,
 'based': 445,
 'oh': 446,
 'several': 447,
 'fans': 448,
 'mother': 449,
 'drama': 450,
 'beginning': 451,
 'killer': 452,
 'lives': 453,
 '5': 454,
 'direction': 455,
 'care': 456,
 'becomes': 457,
 'already': 458,
 'example': 459,
 'laugh': 460,
 'friend': 461,
 'dark': 462,
 'under': 463,
 'despite': 464,
 'seemed': 465,
 'throughout': 466,
 '4': 467,
 'turn': 468,
 'unfortunately': 469,
 'wanted': 470,
 "i'd": 471,
 '\x96': 472,
 'children': 473,
 'final': 474,
 'fine': 475,
 'history': 476,
 'amazing': 477,
 'sound': 478,
 'guess': 479,
 'heart': 480,
 'totally': 481,
 'humor': 482,
 'lead': 483,
 'writing': 484,
 'michael': 485,
 'quality': 486,
 "you'll": 487,
 'close': 488,
 'son': 489,
 'guys': 490,
 'wants': 491,
 'works': 492,
 'behind': 493,
 'tries': 494,
 'art': 495,
 'side': 496,
 'game': 497,
 'past': 498,
 'able': 499,
 'b': 500,
 'days': 501,
 'turns': 502,
 'child': 503,
 "they're": 504,
 'hand': 505,
 'flick': 506,
 'enjoyed': 507,
 'act': 508,
 'genre': 509,
 'town': 510,
 'favorite': 511,
 'soon': 512,
 'kill': 513,
 'starts': 514,
 'sometimes': 515,
 'car': 516,
 'gave': 517,
 'run': 518,
 'late': 519,
 'actress': 520,
 'etc': 521,
 'eyes': 522,
 'directed': 523,
 'horrible': 524,
 "won't": 525,
 'brilliant': 526,
 'viewer': 527,
 'parts': 528,
 'themselves': 529,
 'self': 530,
 'hour': 531,
 'expect': 532,
 'thinking': 533,
 'stories': 534,
 'stuff': 535,
 'girls': 536,
 'obviously': 537,
 'blood': 538,
 'decent': 539,
 'city': 540,
 'voice': 541,
 'highly': 542,
 'myself': 543,
 'feeling': 544,
 'fight': 545,
 'except': 546,
 'slow': 547,
 'matter': 548,
 'type': 549,
 'kid': 550,
 'anyway': 551,
 'roles': 552,
 'heard': 553,
 'killed': 554,
 'says': 555,
 'god': 556,
 'age': 557,
 'moment': 558,
 'took': 559,
 'leave': 560,
 'writer': 561,
 'strong': 562,
 'cannot': 563,
 'violence': 564,
 'police': 565,
 'hit': 566,
 'happens': 567,
 'stop': 568,
 'particularly': 569,
 'known': 570,
 'involved': 571,
 'happened': 572,
 'extremely': 573,
 'daughter': 574,
 'obvious': 575,
 'told': 576,
 'chance': 577,
 'living': 578,
 'coming': 579,
 'lack': 580,
 'experience': 581,
 'alone': 582,
 'including': 583,
 "wouldn't": 584,
 'murder': 585,
 'attempt': 586,
 's': 587,
 'james': 588,
 'please': 589,
 'happen': 590,
 'wonder': 591,
 'crap': 592,
 'brother': 593,
 'ago': 594,
 "film's": 595,
 'gore': 596,
 'complete': 597,
 'none': 598,
 'interest': 599,
 'score': 600,
 'group': 601,
 'cut': 602,
 'simple': 603,
 'save': 604,
 'looked': 605,
 'ok': 606,
 'hell': 607,
 'career': 608,
 'number': 609,
 'song': 610,
 'possible': 611,
 'seriously': 612,
 'annoying': 613,
 'exactly': 614,
 'sad': 615,
 'shown': 616,
 'running': 617,
 'serious': 618,
 'musical': 619,
 'yourself': 620,
 'taken': 621,
 'whose': 622,
 'released': 623,
 'cinematography': 624,
 'david': 625,
 'scary': 626,
 'ends': 627,
 'english': 628,
 'hero': 629,
 'usually': 630,
 'hours': 631,
 'reality': 632,
 'opening': 633,
 "i'll": 634,
 'across': 635,
 'light': 636,
 'jokes': 637,
 'today': 638,
 'hilarious': 639,
 'somewhat': 640,
 'usual': 641,
 'body': 642,
 'ridiculous': 643,
 'cool': 644,
 'started': 645,
 'level': 646,
 'view': 647,
 'relationship': 648,
 'change': 649,
 'opinion': 650,
 'happy': 651,
 'middle': 652,
 'taking': 653,
 'wish': 654,
 'finds': 655,
 'husband': 656,
 'order': 657,
 'saying': 658,
 'shots': 659,
 'talking': 660,
 'ones': 661,
 'documentary': 662,
 'huge': 663,
 'novel': 664,
 'mostly': 665,
 'female': 666,
 'robert': 667,
 'power': 668,
 'episodes': 669,
 'room': 670,
 'important': 671,
 'rating': 672,
 'talent': 673,
 'five': 674,
 'major': 675,
 'turned': 676,
 'strange': 677,
 'word': 678,
 'modern': 679,
 'call': 680,
 'apparently': 681,
 'disappointed': 682,
 'single': 683,
 'events': 684,
 'due': 685,
 'four': 686,
 'songs': 687,
 'basically': 688,
 'attention': 689,
 '7': 690,
 'knows': 691,
 'clearly': 692,
 'supporting': 693,
 'knew': 694,
 'non': 695,
 'comic': 696,
 'television': 697,
 'british': 698,
 'fast': 699,
 'earth': 700,
 'country': 701,
 'future': 702,
 'class': 703,
 'cheap': 704,
 'thriller': 705,
 'silly': 706,
 '8': 707,
 'king': 708,
 'problems': 709,
 "aren't": 710,
 'easily': 711,
 'words': 712,
 'tells': 713,
 'jack': 714,
 'miss': 715,
 'local': 716,
 'sequence': 717,
 'entertainment': 718,
 'bring': 719,
 'paul': 720,
 'beyond': 721,
 'upon': 722,
 'whether': 723,
 'predictable': 724,
 'moving': 725,
 'romantic': 726,
 'straight': 727,
 'similar': 728,
 'sets': 729,
 'review': 730,
 'oscar': 731,
 'falls': 732,
 'mystery': 733,
 'enjoyable': 734,
 'appears': 735,
 'talk': 736,
 'rock': 737,
 'needs': 738,
 'george': 739,
 'giving': 740,
 'eye': 741,
 'richard': 742,
 'within': 743,
 'ten': 744,
 'animation': 745,
 'message': 746,
 'near': 747,
 'theater': 748,
 'above': 749,
 'dull': 750,
 'sequel': 751,
 'nearly': 752,
 'theme': 753,
 'points': 754,
 'stand': 755,
 "'": 756,
 'mention': 757,
 'bunch': 758,
 'add': 759,
 'lady': 760,
 'herself': 761,
 'feels': 762,
 'release': 763,
 'red': 764,
 'team': 765,
 'storyline': 766,
 'surprised': 767,
 'ways': 768,
 'using': 769,
 'named': 770,
 "haven't": 771,
 'easy': 772,
 'lots': 773,
 'fantastic': 774,
 'begins': 775,
 'actual': 776,
 'working': 777,
 'effort': 778,
 'york': 779,
 'die': 780,
 'hate': 781,
 'french': 782,
 'tale': 783,
 'minute': 784,
 'stay': 785,
 '9': 786,
 'clear': 787,
 'elements': 788,
 'feature': 789,
 'among': 790,
 'follow': 791,
 'comments': 792,
 're': 793,
 'viewers': 794,
 'avoid': 795,
 'sister': 796,
 'typical': 797,
 'showing': 798,
 'editing': 799,
 'tried': 800,
 "what's": 801,
 'famous': 802,
 'sorry': 803,
 'dialog': 804,
 'fall': 805,
 'check': 806,
 'period': 807,
 'form': 808,
 'season': 809,
 'certain': 810,
 'filmed': 811,
 'weak': 812,
 'soundtrack': 813,
 'means': 814,
 'material': 815,
 'buy': 816,
 'realistic': 817,
 'somehow': 818,
 'figure': 819,
 'crime': 820,
 'gone': 821,
 'doubt': 822,
 'peter': 823,
 'tom': 824,
 'viewing': 825,
 'kept': 826,
 't': 827,
 'general': 828,
 'leads': 829,
 'greatest': 830,
 'space': 831,
 'lame': 832,
 'suspense': 833,
 'dance': 834,
 'brought': 835,
 'imagine': 836,
 'third': 837,
 'atmosphere': 838,
 'hear': 839,
 'particular': 840,
 'whatever': 841,
 'sequences': 842,
 'parents': 843,
 'lee': 844,
 'move': 845,
 'indeed': 846,
 'eventually': 847,
 'rent': 848,
 'learn': 849,
 'de': 850,
 'note': 851,
 'forget': 852,
 'deal': 853,
 'reviews': 854,
 'wait': 855,
 'average': 856,
 'japanese': 857,
 'poorly': 858,
 'sexual': 859,
 'okay': 860,
 'premise': 861,
 'surprise': 862,
 'zombie': 863,
 'believable': 864,
 'stage': 865,
 'sit': 866,
 'possibly': 867,
 "who's": 868,
 'decided': 869,
 'expected': 870,
 "you've": 871,
 'subject': 872,
 'nature': 873,
 'became': 874,
 'difficult': 875,
 'free': 876,
 'screenplay': 877,
 'killing': 878,
 'truth': 879,
 'romance': 880,
 'dr': 881,
 'nor': 882,
 'reading': 883,
 'needed': 884,
 'question': 885,
 'leaves': 886,
 'street': 887,
 '20': 888,
 'meets': 889,
 'hot': 890,
 'unless': 891,
 'begin': 892,
 'baby': 893,
 'credits': 894,
 'otherwise': 895,
 'imdb': 896,
 'superb': 897,
 'write': 898,
 'shame': 899,
 "let's": 900,
 'situation': 901,
 'dramatic': 902,
 'memorable': 903,
 'directors': 904,
 'earlier': 905,
 'badly': 906,
 'meet': 907,
 'open': 908,
 'disney': 909,
 'dog': 910,
 'joe': 911,
 'male': 912,
 'weird': 913,
 'forced': 914,
 'acted': 915,
 'laughs': 916,
 'sci': 917,
 'emotional': 918,
 'older': 919,
 'realize': 920,
 'fi': 921,
 'dream': 922,
 'society': 923,
 'writers': 924,
 'interested': 925,
 'footage': 926,
 'comment': 927,
 'forward': 928,
 'crazy': 929,
 'deep': 930,
 'whom': 931,
 'plus': 932,
 'beauty': 933,
 'america': 934,
 'sounds': 935,
 'fantasy': 936,
 'directing': 937,
 'keeps': 938,
 'development': 939,
 'ask': 940,
 'features': 941,
 'air': 942,
 'quickly': 943,
 'mess': 944,
 'creepy': 945,
 'towards': 946,
 'perfectly': 947,
 'mark': 948,
 'worked': 949,
 'box': 950,
 'cheesy': 951,
 'unique': 952,
 'hands': 953,
 'setting': 954,
 'plenty': 955,
 'previous': 956,
 'brings': 957,
 'result': 958,
 'total': 959,
 'e': 960,
 'effect': 961,
 'incredibly': 962,
 'personal': 963,
 'monster': 964,
 'rate': 965,
 'fire': 966,
 'business': 967,
 'leading': 968,
 'apart': 969,
 'casting': 970,
 'admit': 971,
 'background': 972,
 'powerful': 973,
 'appear': 974,
 'joke': 975,
 'girlfriend': 976,
 'telling': 977,
 'meant': 978,
 'hardly': 979,
 'present': 980,
 'christmas': 981,
 'battle': 982,
 'potential': 983,
 'create': 984,
 'break': 985,
 'bill': 986,
 'pay': 987,
 'masterpiece': 988,
 'gay': 989,
 'return': 990,
 'political': 991,
 'dumb': 992,
 'fails': 993,
 'fighting': 994,
 'various': 995,
 'era': 996,
 'portrayed': 997,
 'co': 998,
 'cop': 999,
 'secret': 1000,
 ...}

In [17]:
# tokenizer.texts_to_matrix?

In [71]:
binary_martix = tokenizer.texts_to_matrix(texts, mode='binary')
binary_martix


Out[71]:
array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

In [74]:
len(binary_martix)


Out[74]:
25000

In [75]:
len(binary_martix[0])


Out[75]:
10000

In [72]:
count_matrix = tokenizer.texts_to_matrix(texts, mode='count')
count_matrix


Out[72]:
array([[ 0.,  4.,  1., ...,  0.,  0.,  0.],
       [ 0., 53.,  0., ...,  0.,  0.,  0.],
       [ 0., 10.,  3., ...,  0.,  0.,  0.],
       ...,
       [ 0., 23.,  7., ...,  0.,  0.,  0.],
       [ 0.,  9.,  4., ...,  0.,  0.,  0.],
       [ 0.,  9.,  7., ...,  0.,  0.,  0.]])

In [77]:
len(count_matrix)


Out[77]:
25000

In [79]:
len(binary_martix[0])


Out[79]:
10000

In [81]:
tfidf_matrix = tokenizer.texts_to_matrix(texts, mode='tfidf')
tfidf_matrix


Out[81]:
array([[0.        , 1.66394589, 0.71027668, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 3.4657488 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 2.30286883, 1.49059538, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 2.88365037, 2.09241129, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 2.2294017 , 1.69492924, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 2.2294017 , 2.09241129, ..., 0.        , 0.        ,
        0.        ]])

In [18]:
# tokenizer.texts_to_sequences?

In [24]:
sequences = tokenizer.texts_to_sequences(texts)

In [25]:
len(sequences)


Out[25]:
25000

In [26]:
len(sequences[0])


Out[26]:
104

In [27]:
sequences[0]


Out[27]:
[62,
 4,
 3,
 129,
 34,
 44,
 7576,
 1414,
 15,
 3,
 4252,
 514,
 43,
 16,
 3,
 633,
 133,
 12,
 6,
 3,
 1301,
 459,
 4,
 1751,
 209,
 3,
 7693,
 308,
 6,
 676,
 80,
 32,
 2137,
 1110,
 3008,
 31,
 1,
 929,
 4,
 42,
 5120,
 469,
 9,
 2665,
 1751,
 1,
 223,
 55,
 16,
 54,
 828,
 1318,
 847,
 228,
 9,
 40,
 96,
 122,
 1484,
 57,
 145,
 36,
 1,
 996,
 141,
 27,
 676,
 122,
 1,
 411,
 59,
 94,
 2278,
 303,
 772,
 5,
 3,
 837,
 20,
 3,
 1755,
 646,
 42,
 125,
 71,
 22,
 235,
 101,
 16,
 46,
 49,
 624,
 31,
 702,
 84,
 702,
 378,
 3493,
 2,
 8422,
 67,
 27,
 107,
 3348]

In [28]:
data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)


Shape of data tensor: (25000, 500)
Shape of label tensor: (25000,)

In [29]:
len(data[0])


Out[29]:
500

In [30]:
# by convention zero is the empty word
data[0]


Out[30]:
array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
         62,    4,    3,  129,   34,   44, 7576, 1414,   15,    3, 4252,
        514,   43,   16,    3,  633,  133,   12,    6,    3, 1301,  459,
          4, 1751,  209,    3, 7693,  308,    6,  676,   80,   32, 2137,
       1110, 3008,   31,    1,  929,    4,   42, 5120,  469,    9, 2665,
       1751,    1,  223,   55,   16,   54,  828, 1318,  847,  228,    9,
         40,   96,  122, 1484,   57,  145,   36,    1,  996,  141,   27,
        676,  122,    1,  411,   59,   94, 2278,  303,  772,    5,    3,
        837,   20,    3, 1755,  646,   42,  125,   71,   22,  235,  101,
         16,   46,   49,  624,   31,  702,   84,  702,  378, 3493,    2,
       8422,   67,   27,  107, 3348])

Split data into training and test sets and make sure to make this balanced


In [31]:
from sklearn.model_selection import train_test_split

In [32]:
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

In [33]:
x_train.shape


Out[33]:
(20000, 500)

In [34]:
y_train.shape


Out[34]:
(20000,)

In [35]:
np.unique(y_train, return_counts=True)


Out[35]:
(array([0, 1]), array([ 9985, 10015], dtype=int64))

In [36]:
np.unique(y_test, return_counts=True)


Out[36]:
(array([0, 1]), array([2515, 2485], dtype=int64))

Use Stratify to make it balanced


In [37]:
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42, stratify=labels)

In [38]:
np.unique(y_train, return_counts=True)


Out[38]:
(array([0, 1]), array([10000, 10000], dtype=int64))

In [39]:
np.unique(y_test, return_counts=True)


Out[39]:
(array([0, 1]), array([2500, 2500], dtype=int64))

In [ ]: