Data preprocessing

- Download data to the server
- Convert text to sequences.
- Configure sequences for a RNN model.

Download data to the server

Command line in the server

Path to data:
    cd /home/ubuntu/data/training/text/sentiment
Download dataset: 
    wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Uncompress it:
    tar -zxvf aclImdb_v1.tar.gz

Convert text to sequences

- List of all text files
- Read files into python
- Tokenize
- Create dictionaries to recode
- Recode tokens into ids and create sentences

In [1]:
#Imports and paths
from __future__ import print_function

import numpy as np

data_path='/home/ubuntu/data/training/text/sentiment/aclImdb/'

In [2]:
# Generator of list of files in a folder and subfolders
import os
import shutil
import fnmatch

def gen_find(filepattern, toppath):
    '''
    Generator with a recursive list of files in the toppath that match filepattern 
    Inputs:
        filepattern(str): Command stype pattern 
        toppath(str): Root path
    '''
    for path, dirlist, filelist in os.walk(toppath):
        for name in fnmatch.filter(filelist, filepattern):
            yield os.path.join(path, name)

#Test
#print(gen_find("*.txt", data_path+'train/pos/').next())

In [3]:
def read_sentences(path):
    sentences = []
    sentences_list = gen_find("*.txt", path)
    for ff in sentences_list:
        with open(ff, 'r') as f:
            sentences.append(f.readline().strip())
    return sentences        

#Test
print(read_sentences(data_path+'train/pos/')[0:2])


["I read the book and saw the movie. Both excellent. The movie is diamond among coals during this era. Liebman and Selby dominate the screen and communicate the intensity of their characters without flaw. This film should have made them stars. Shame on the studio for not putting everything they had behind this film. It could have easily been a franchise. Release on DVD is a must and a worthy remake would revive this film. Look for it in your TV guide and if you see it listed, no matter how late, watch it. You won't be disappointed. Do yourself another favor - read the book (same title). It'll blow you away. Times have changed dramatically since those days, or at least we like to think they have.", "Let me first state that while I have viewed every episode of StarTrek at least twice, I do not consider myself a Trekker or Trekkie. Those are people who live in their parents basement and attend conventions wearing costumes with pointed rubber ears. I gave this movie a seven casting aside the fiction historical errors. The acting was better than average, but the plot held no surprises. They tried very hard to reverse engineer the technology but still the special effects were just to great a temptation. Now as to the historical errors, if you call them that, the first Capitan to pilot the Enterprise was Commander April, then Capt. Pike, Jim Kirk, etc.. According to a statement made by both Riker and Kirk we dicovered the Klingons and educated them and gave them the technology (that's the reason a prime directive was created) but like I said these are no reason to discredit this fine series. I hope the plots will get deeper, and then special effects can take a backseat."]

In [4]:
print(read_sentences(data_path+'train/neg/')[0:2])


['What Hopkins does succeed at with this effort as writer and director is giving us a sense that we know absolutely no one in the film. However, perhaps therein lies the problem. His movie has a lot of ambition and his intentions were obviously complex and drawn from very deep within, but it\'s so impersonal. There are no characters. We never know who anyone is, thus there is no investment on our part.<br /><br />It could be about a screenwriter intermingle with his own characters. Is it? Maybe. By that I don\'t mean that Slipstream is ambiguous; I mean that there is no telling. Hopkins\'s film is an experiment. On the face of it, one could make the case that it is about a would-be screenwriter, who at the very moment of his meeting with fate, realizes that life is hit and miss, and/or success is blind chance, as he is hurled into a "slipstream" of collisions between points in time, dreams, thoughts, and reality. Nevertheless, it is so unremittingly cerebral that it leaves no room for any hint of emotion, even to the tiny, quite rudimentary extent of allowing us a connection with its characters.<br /><br />I didn\'t think the nippy and flamboyant school of shaky, machine-gun-speed camera-work and editing disengaged me, but reflecting upon the film I am beginning to realize that it had a lot to do with it. There are so many movies of the past decade in which the cuts or camera movement have sound effects as well as other atmosphere-deteriorating technical doodads. I suppose in this case it was justified in that its purpose was to compose the impressionistic responsiveness of dreams. However, I knew barely anything about Slipstream when watching it, and I came out the same way. And I just do not care, because Hopkins made no effort to make us care. There are interactive movies, and there are movies that sit in a rocking chair and knit, unaware of your presence. Slipstream is the latter.', "SPOILERS (ALTHOUGH NONE THAT AREN'T REVEALED IN THE FIRST TWO MINUTES OF THE MOVIE)<br /><br />Robin Williams is actually quite good in this as the friendly, lonely, emotionally stunted loser Sy. He makes a very human, even sympathetic psycho, and really disappears into the character--no small feat for such a recognizable performer. <br /><br />Too bad the rest of the movie is such a waste. The supporting performances (and performers) wouldn't look out of place in a soft-core porno (it doesn't help that every character but Sy is made of 100% cardboard). At times, the director actually seems to be trying to frustrate suspense: we know from the very first moments a) that Sy is a complete whack-job, b) that he survives, and c) that he gets nabbed by the cops at the end. So all we're left to ponder is the hows and the whys, and the answers provided aren't all that interesting.<br /><br />The plot is plodding and contrived, and features some nonsensical moments (for instance, the husband berates his wife for her expensive tastes, even though she seems to spend all her free time at the local discount superstore). About two thirds of the way through, Sy does something so irredeemably stupid that it makes one wonder how much he actually cares about his grand revenge scheme. And the final clichéd explanation of his psychosis, right out of `Peeping Tom,' is a terrible copout.<br /><br />The dialogue is of the absolute worst sort. It's not overwritten, or awkward, or unbelievable, or bad in any other way that could be considered fun, even for bad-movie lovers. Instead, every line is purely, hideously functional--it's as if the director handed a plot outline to a newspaper copywriter and said, `Hey, I need a workable script on this--in an hour.' It made me want to scream, honestly.<br /><br />This movie seems to be a throwback to the suburban beware-the-help thrillers of the eighties and nineties (`The Hand That Rocks the Cradle,' e.g.), and while it's certainly unpleasant, it's never really scary. Sy's fetishism occasionally makes you feel uncomfortable, but on its own that's not enough to make the film work. In the end, lack of craftsmanship from everyone involved, except Robin Williams, sinks this one. 3 out of 10."]

In [5]:
def tokenize(sentences):
    from nltk import word_tokenize
    print( 'Tokenizing...',)
    tokens = []
    for sentence in sentences:
        tokens += [word_tokenize(sentence)]
    print('Done!')

    return tokens

print(tokenize(read_sentences(data_path+'train/pos/')[0:2]))


Tokenizing...
Done!
[['I', 'read', 'the', 'book', 'and', 'saw', 'the', 'movie', '.', 'Both', 'excellent', '.', 'The', 'movie', 'is', 'diamond', 'among', 'coals', 'during', 'this', 'era', '.', 'Liebman', 'and', 'Selby', 'dominate', 'the', 'screen', 'and', 'communicate', 'the', 'intensity', 'of', 'their', 'characters', 'without', 'flaw', '.', 'This', 'film', 'should', 'have', 'made', 'them', 'stars', '.', 'Shame', 'on', 'the', 'studio', 'for', 'not', 'putting', 'everything', 'they', 'had', 'behind', 'this', 'film', '.', 'It', 'could', 'have', 'easily', 'been', 'a', 'franchise', '.', 'Release', 'on', 'DVD', 'is', 'a', 'must', 'and', 'a', 'worthy', 'remake', 'would', 'revive', 'this', 'film', '.', 'Look', 'for', 'it', 'in', 'your', 'TV', 'guide', 'and', 'if', 'you', 'see', 'it', 'listed', ',', 'no', 'matter', 'how', 'late', ',', 'watch', 'it', '.', 'You', 'wo', "n't", 'be', 'disappointed', '.', 'Do', 'yourself', 'another', 'favor', '-', 'read', 'the', 'book', '(', 'same', 'title', ')', '.', 'It', "'ll", 'blow', 'you', 'away', '.', 'Times', 'have', 'changed', 'dramatically', 'since', 'those', 'days', ',', 'or', 'at', 'least', 'we', 'like', 'to', 'think', 'they', 'have', '.'], ['Let', 'me', 'first', 'state', 'that', 'while', 'I', 'have', 'viewed', 'every', 'episode', 'of', 'StarTrek', 'at', 'least', 'twice', ',', 'I', 'do', 'not', 'consider', 'myself', 'a', 'Trekker', 'or', 'Trekkie', '.', 'Those', 'are', 'people', 'who', 'live', 'in', 'their', 'parents', 'basement', 'and', 'attend', 'conventions', 'wearing', 'costumes', 'with', 'pointed', 'rubber', 'ears', '.', 'I', 'gave', 'this', 'movie', 'a', 'seven', 'casting', 'aside', 'the', 'fiction', 'historical', 'errors', '.', 'The', 'acting', 'was', 'better', 'than', 'average', ',', 'but', 'the', 'plot', 'held', 'no', 'surprises', '.', 'They', 'tried', 'very', 'hard', 'to', 'reverse', 'engineer', 'the', 'technology', 'but', 'still', 'the', 'special', 'effects', 'were', 'just', 'to', 'great', 'a', 'temptation', '.', 'Now', 'as', 'to', 'the', 'historical', 'errors', ',', 'if', 'you', 'call', 'them', 'that', ',', 'the', 'first', 'Capitan', 'to', 'pilot', 'the', 'Enterprise', 'was', 'Commander', 'April', ',', 'then', 'Capt', '.', 'Pike', ',', 'Jim', 'Kirk', ',', 'etc..', 'According', 'to', 'a', 'statement', 'made', 'by', 'both', 'Riker', 'and', 'Kirk', 'we', 'dicovered', 'the', 'Klingons', 'and', 'educated', 'them', 'and', 'gave', 'them', 'the', 'technology', '(', 'that', "'s", 'the', 'reason', 'a', 'prime', 'directive', 'was', 'created', ')', 'but', 'like', 'I', 'said', 'these', 'are', 'no', 'reason', 'to', 'discredit', 'this', 'fine', 'series', '.', 'I', 'hope', 'the', 'plots', 'will', 'get', 'deeper', ',', 'and', 'then', 'special', 'effects', 'can', 'take', 'a', 'backseat', '.']]

In [6]:
from sklearn.utils import shuffle

In [7]:
sentences_trn_pos = tokenize(read_sentences(data_path+'train/pos/'))
sentences_trn_neg = tokenize(read_sentences(data_path+'train/neg/'))
sentences_trn = sentences_trn_pos + sentences_trn_neg

y_train_trn = [1] * len(sentences_trn_pos) + [0] * len(sentences_trn_neg)
X_trn, y_trn = shuffle(sentences_trn, y_train_trn)

len(X_trn)


Tokenizing...
Done!
Tokenizing...
Done!
Out[7]:
25000

In [8]:
sentences_pos = tokenize(read_sentences(data_path+'test/pos/'))
sentences_neg = tokenize(read_sentences(data_path+'test/neg/'))
sentences_tst = sentences_pos + sentences_neg

y_test = [1] * len(sentences_pos) + [0] * len(sentences_neg)
X_tst, y_tst = shuffle(sentences_tst, y_test)

len(X_tst)


Tokenizing...
Done!
Tokenizing...
Done!
Out[8]:
25000

In [9]:
#Save tokenization whit numpy
np.save('/tmp/sentiment_X_trn.npy', X_trn)
np.save('/tmp/sentiment_X_tst.npy', X_tst)
np.save('/tmp/sentiment_y_trn.npy', y_trn)
np.save('/tmp/sentiment_y_tst.npy', y_tst)

In [10]:
# Load tokenization whit numpy
X_trn = np.load('/tmp/sentiment_X_trn.npy')
X_tst = np.load('/tmp/sentiment_X_tst.npy')
y_trn = np.load('/tmp/sentiment_y_trn.npy')
y_tst = np.load('/tmp/sentiment_y_tst.npy')

In [11]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print( 'Building dictionary..',)
    wordcount = dict()
    #For each worn in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = list(wordcount.values()) # List of frequencies
    keys = list(wordcount) #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print( np.sum(counts), ' total words ', len(keys), ' unique words')

    return worddict, wordcount


worddict, wordcount = build_dict(sentences_trn)

print(worddict['the'], wordcount['the'])


Building dictionary..
7056532  total words  134957  unique words
2 289300

In [12]:
# 
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [13]:
# Create train and test data

#Read train sentences and generate target y
train_x_pos = generate_sequence(sentences_trn_pos, worddict)
train_x_neg = generate_sequence(sentences_trn_neg, worddict)
X_train_full = train_x_pos + train_x_neg
y_train_full = [1] * len(train_x_pos) + [0] * len(train_x_neg)

print(X_train_full[0], y_train_full[0])


[15, 357, 2, 302, 5, 234, 2, 25, 4, 1499, 350, 4, 21, 25, 9, 6486, 847, 32904, 347, 19, 1111, 4, 32836, 5, 31969, 8842, 2, 314, 5, 5938, 2, 3066, 7, 82, 122, 237, 3421, 4, 61, 26, 156, 37, 109, 112, 441, 4, 5224, 33, 2, 1215, 24, 36, 1601, 346, 48, 76, 543, 19, 26, 4, 51, 96, 37, 721, 95, 6, 3245, 4, 24621, 33, 308, 9, 6, 242, 5, 6, 1708, 1022, 66, 10202, 19, 26, 4, 2383, 24, 16, 14, 150, 280, 4863, 5, 78, 34, 84, 16, 3644, 3, 85, 578, 114, 573, 3, 130, 16, 4, 221, 538, 30, 39, 722, 4, 423, 655, 198, 2090, 91, 357, 2, 302, 28, 185, 469, 27, 4, 51, 254, 2978, 34, 271, 4, 4485, 37, 1185, 7415, 274, 172, 566, 3, 54, 43, 233, 100, 50, 8, 121, 48, 37, 4] 1

In [14]:
#Read test sentences and generate target y
sentences_tst_pos = read_sentences(data_path+'test/pos/')
sentences_tst_neg = read_sentences(data_path+'test/neg/')

test_x_pos = generate_sequence(tokenize(sentences_tst_pos), worddict)
test_x_neg = generate_sequence(tokenize(sentences_tst_neg), worddict)
X_test_full = test_x_pos + test_x_neg
y_test_full = [1] * len(test_x_pos) + [0] * len(test_x_neg)

print(X_test_full[0])
print(y_test_full[0])


Tokenizing...
Done!
Tokenizing...
Done!
[19309, 28, 2, 4163, 5163, 49, 32, 4137, 18, 6809, 31, 5, 307, 2159, 2340, 27, 9, 540, 4, 118, 38, 12842, 17343, 28, 1991, 11446, 27, 225, 38, 1317, 5, 28, 809, 32, 25148, 24, 35636, 31, 27, 48582, 104, 4, 472, 3, 165, 2374, 119, 3878, 8, 98, 97, 3, 13214, 9, 575, 5, 56, 52, 1494, 295, 97, 198, 5163, 41, 6844, 48, 830, 14, 146, 5, 7688, 6, 356, 388, 28, 1855, 1, 3, 11318, 21530, 27, 8, 218, 112, 8, 3412, 18, 8515, 8, 98, 46, 29592, 8, 113, 19309, 5, 17343, 174, 101, 202, 69, 12, 13, 10, 11, 12, 13, 10, 11, 137, 189, 148, 93, 16, 1002, 4, 416, 2, 268, 132, 2340, 8, 32, 4137, 18, 6809, 31, 28, 238, 7, 72, 81, 572, 27, 15, 20, 1008, 2, 279, 3, 29, 19, 183, 20, 886, 7, 277, 4, 21, 25, 88, 30, 218, 431, 765, 24, 6, 381, 28, 765, 159, 114, 96, 16, 59, 27, 5, 2, 434, 5, 1193, 35, 183, 200, 182, 4, 478, 68, 35, 6, 184, 2560, 2186, 1599, 1359, 14, 8, 4921, 224, 235, 493, 5, 2, 26, 136, 3088, 780, 4, 21, 25, 111, 56, 6, 184, 209, 692, 36, 273, 14, 6, 235, 25, 159, 6, 1062, 145, 476, 28, 2258, 492, 1, 27, 47, 9, 1107, 5, 36, 275, 24, 956, 5, 6, 717, 14, 72, 1, 56, 38, 4835, 149, 57, 8, 143, 38, 11596, 691, 4, 320, 3450, 56, 6, 356, 2056, 119, 22, 6, 42181, 12, 13, 10, 11, 12, 13, 10, 11, 21, 134, 9, 63, 159, 1, 9, 220, 3, 2560, 2272, 5, 1453, 125, 21530, 88, 30, 37, 94, 8, 60, 29, 2652, 16, 149, 5, 2788, 12887, 28, 2, 619, 7, 19309, 27, 5, 11446, 35, 682, 22, 2, 4163, 7752, 4, 331, 537, 197, 9, 73, 2, 7752, 37, 438, 28, 60, 30, 948, 27, 5, 75, 1614, 24, 6, 5711, 5, 40, 8957, 3, 32, 118, 15, 167, 14669, 7, 5711, 41, 31, 21, 345, 325, 35, 63, 28, 85, 2349, 1831, 164, 27, 5, 19, 9, 42, 7, 2, 184, 235, 127, 8, 1528, 506, 5, 610, 14, 46, 473, 115, 4, 425, 322, 342, 4, 15, 223, 16, 6, 1376, 4]
1

Configure sequences for a RNN model

- Remove words with low frequency
- Truncate / complete sequences to the same length

In [15]:
#Median length of sentences
print('Median length: ', np.median([len(x) for x in X_test_full]))


Median length:  208.0

In [16]:
max_features = 50000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

In [17]:
#Select the most frequent max_features, recode others using 0
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_test  = remove_features(X_test_full)
y_train = y_train_full
y_test = y_test_full

print(X_test[1])


[1051, 25, 3, 45, 785, 7, 6, 63, 25, 20, 68, 3, 80, 3, 177, 3, 251, 3, 5, 500, 4, 15, 20, 33, 2, 1547, 7, 86, 2481, 2, 236, 1732, 12, 13, 10, 11, 12, 13, 10, 11, 401, 910, 55, 16, 3, 9, 6, 524, 453, 26, 3, 29, 15, 436, 16, 64, 93, 131, 259, 453, 4187, 12, 13, 10, 11, 12, 13, 10, 11, 0, 1, 310, 1487, 6980, 3, 47, 9, 1992, 23, 38, 12817, 336, 4, 472, 6, 619, 14, 2, 12585, 5, 126, 57, 28806, 12, 13, 10, 11, 12, 13, 10, 11, 4717, 23131, 9, 334, 5, 940, 6, 991, 257, 4, 263, 9, 46, 350, 197, 8, 2, 855, 12, 13, 10, 11, 12, 13, 10, 11, 15, 436, 2, 9542, 7827, 833, 2, 155, 3, 83, 181, 100, 139, 17, 2, 305, 123, 71, 39, 589, 3, 100, 3669, 342, 346, 49, 38, 246, 7, 710, 5, 2218, 90, 846, 9, 2, 536, 59]

In [18]:
from tensorflow.contrib.keras import preprocessing

# Cut or complete the sentences to length = maxlen
print("Pad sequences (samples x time)")

X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print(X_test[0])


Pad sequences (samples x time)
X_train shape: (25000, 200)
X_test shape: (25000, 200)
[ 2560  2186  1599  1359    14     8  4921   224   235   493     5     2
    26   136  3088   780     4    21    25   111    56     6   184   209
   692    36   273    14     6   235    25   159     6  1062   145   476
    28  2258   492     1    27    47     9  1107     5    36   275    24
   956     5     6   717    14    72     1    56    38  4835   149    57
     8   143    38 11596   691     4   320  3450    56     6   356  2056
   119    22     6 42181    12    13    10    11    12    13    10    11
    21   134     9    63   159     1     9   220     3  2560  2272     5
  1453   125 21530    88    30    37    94     8    60    29  2652    16
   149     5  2788 12887    28     2   619     7 19309    27     5 11446
    35   682    22     2  4163  7752     4   331   537   197     9    73
     2  7752    37   438    28    60    30   948    27     5    75  1614
    24     6  5711     5    40  8957     3    32   118    15   167 14669
     7  5711    41    31    21   345   325    35    63    28    85  2349
  1831   164    27     5    19     9    42     7     2   184   235   127
     8  1528   506     5   610    14    46   473   115     4   425   322
   342     4    15   223    16     6  1376     4]

In [19]:
# Shuffle data
from sklearn.utils import shuffle
X_train, y_train = shuffle(X_train, y_train, random_state=0)

In [20]:
# Export train and test data
np.save(data_path + 'X_train', X_train)
np.save(data_path + 'y_train', y_train)
np.save(data_path + 'X_test',  X_test)
np.save(data_path + 'y_test',  y_test)

In [21]:
# Export worddict
import pickle

with open(data_path + 'worddict.pickle', 'wb') as pfile:
    pickle.dump(worddict, pfile)

In [ ]: