Problem Set 8 Review & Transfer Learning with word2vec

Import various modules that we need for this notebook (now using Keras 1.0.0)


In [2]:
%pylab inline

import copy

import numpy as np
import pandas as pd
import sys
import os
import re

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, RMSprop
from keras.layers.normalization import BatchNormalization
from keras.layers.wrappers import TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN, LSTM, GRU

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from gensim.models import word2vec


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['copy']
`%matplotlib` prevents importing * from pylab and numpy

I. Problem Set 8, Part 1

Let's work through a solution to the first part of problem set 8, where you applied various techniques to the STL-10 dataset.


In [2]:
dir_in = "../../../class_data/stl10/"
X_train = np.genfromtxt(dir_in + 'X_train_new.csv', delimiter=',')
Y_train = np.genfromtxt(dir_in + 'Y_train.csv', delimiter=',')
X_test = np.genfromtxt(dir_in + 'X_test_new.csv', delimiter=',')
Y_test = np.genfromtxt(dir_in + 'Y_test.csv', delimiter=',')

And construct a flattened version of it, for the linear model case:


In [3]:
Y_train_flat = np.zeros(Y_train.shape[0])
Y_test_flat  = np.zeros(Y_test.shape[0])
for i in range(10):
    Y_train_flat[Y_train[:,i] == 1] = i
    Y_test_flat[Y_test[:,i] == 1]   = i

(1) neural network

We now build and evaluate a neural network.


In [4]:
model = Sequential()

model.add(Dense(1024, input_shape = (X_train.shape[1],)))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(10))
model.add(Activation('softmax'))

rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms,
              metrics=['accuracy'])

In [5]:
model.fit(X_train, Y_train, batch_size=32, nb_epoch=5, verbose=1)


Epoch 1/5
5000/5000 [==============================] - 20s - loss: 0.6610 - acc: 0.8184    
Epoch 2/5
5000/5000 [==============================] - 20s - loss: 0.3637 - acc: 0.9078    
Epoch 3/5
5000/5000 [==============================] - 19s - loss: 0.2322 - acc: 0.9314    
Epoch 4/5
5000/5000 [==============================] - 19s - loss: 0.1665 - acc: 0.9536    
Epoch 5/5
5000/5000 [==============================] - 20s - loss: 0.1230 - acc: 0.9646    
Out[5]:
<keras.callbacks.History at 0x11f7b9748>

In [6]:
test_rate = model.evaluate(X_test, Y_test)[1]
print("Test classification rate %0.05f" % test_rate)


8000/8000 [==============================] - 3s     
Test classification rate 0.92525

(2) support vector machine

And now, a basic linear support vector machine.


In [7]:
svc_obj = SVC(kernel='linear', C=1)
svc_obj.fit(X_train, Y_train_flat)

pred = svc_obj.predict(X_test)
pd.crosstab(pred, Y_test_flat)
c_rate = sum(pred == Y_test_flat) / len(pred)
print("Test classification rate %0.05f" % c_rate)


Test classification rate 0.94088

(3) penalized logistc model

And finally, an L1 penalized model:


In [8]:
lr = LogisticRegression(penalty = 'l1')
lr.fit(X_train, Y_train_flat)

pred = lr.predict(X_test)
pd.crosstab(pred, Y_test_flat)
c_rate = sum(pred == Y_test_flat) / len(pred)
print("Test classification rate %0.05f" % c_rate)


Test classification rate 0.93525

II. Problem Set 8, Part 2

Now, let's read in the Chicago crime dataset and see how well we can get a neural network to perform on it.


In [9]:
dir_in = "../../../class_data/chi_python/"
X_train = np.genfromtxt(dir_in + 'chiCrimeMat_X_train.csv', delimiter=',')
Y_train = np.genfromtxt(dir_in + 'chiCrimeMat_Y_train.csv', delimiter=',')
X_test = np.genfromtxt(dir_in + 'chiCrimeMat_X_test.csv', delimiter=',')
Y_test = np.genfromtxt(dir_in + 'chiCrimeMat_Y_test.csv', delimiter=',')

Now, built a neural network for the model


In [10]:
model = Sequential()

model.add(Dense(1024, input_shape = (434,)))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Dense(5))
model.add(Activation('softmax'))

rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms,
              metrics=['accuracy'])

In [11]:
# downsample, if need be:
num_sample = X_train.shape[0]

model.fit(X_train[:num_sample], Y_train[:num_sample], batch_size=32,
          nb_epoch=10, verbose=1)


Epoch 1/10
337619/337619 [==============================] - 579s - loss: 0.7757 - acc: 0.7334   
Epoch 2/10
337619/337619 [==============================] - 582s - loss: 0.7163 - acc: 0.7510   
Epoch 3/10
337619/337619 [==============================] - 582s - loss: 0.7024 - acc: 0.7544   
Epoch 4/10
337619/337619 [==============================] - 582s - loss: 0.6930 - acc: 0.7572   
Epoch 5/10
337619/337619 [==============================] - 587s - loss: 0.6873 - acc: 0.7588   
Epoch 6/10
337619/337619 [==============================] - 588s - loss: 0.6818 - acc: 0.7602   
Epoch 7/10
337619/337619 [==============================] - 593s - loss: 0.6775 - acc: 0.7609   
Epoch 8/10
337619/337619 [==============================] - 592s - loss: 0.6754 - acc: 0.7618   
Epoch 9/10
337619/337619 [==============================] - 593s - loss: 0.6723 - acc: 0.7625   
Epoch 10/10
337619/337619 [==============================] - 594s - loss: 0.6707 - acc: 0.7636   
Out[11]:
<keras.callbacks.History at 0x12476ceb8>

In [12]:
test_rate = model.evaluate(X_test, Y_test)[1]
print("Test classification rate %0.05f" % test_rate)


174320/174320 [==============================] - 38s    
Test classification rate 0.76057

III. Transfer Learning IMDB Sentiment analysis

Now, let's use the word2vec embeddings on the IMDB sentiment analysis corpus. This will allow us to use a significantly larger vocabulary of words. I'll start by reading in the IMDB corpus again from the raw text.


In [4]:
path = "../../../class_data/aclImdb/"

ff = [path + "train/pos/" + x for x in os.listdir(path + "train/pos")] + \
     [path + "train/neg/" + x for x in os.listdir(path + "train/neg")] + \
     [path + "test/pos/" + x for x in os.listdir(path + "test/pos")] + \
     [path + "test/neg/" + x for x in os.listdir(path + "test/neg")]

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)
    
input_label = ([1] * 12500 + [0] * 12500) * 2
input_text  = []

for f in ff:
    with open(f) as fin:
        pass
        input_text += [remove_tags(" ".join(fin.readlines()))]

I'll fit a significantly larger vocabular this time, as the embeddings are basically given for us.


In [5]:
num_words = 5000
max_len = 400
tok = Tokenizer(num_words)
tok.fit_on_texts(input_text[:25000])

In [6]:
X_train = tok.texts_to_sequences(input_text[:25000])
X_test  = tok.texts_to_sequences(input_text[25000:])
y_train = input_label[:25000]
y_test  = input_label[25000:]

X_train = sequence.pad_sequences(X_train, maxlen=max_len)
X_test  = sequence.pad_sequences(X_test,  maxlen=max_len)

In [7]:
words = []
for iter in range(num_words):
    words += [key for key,value in tok.word_index.items() if value==iter+1]

In [8]:
loc = "/Users/taylor/files/word2vec_python/GoogleNews-vectors-negative300.bin"
w2v = word2vec.Word2Vec.load_word2vec_format(loc, binary=True)

In [9]:
weights = np.zeros((num_words,300))
for idx, w in enumerate(words):
    try:
        weights[idx,:] = w2v[w]
    except KeyError as e:
        pass

In [19]:
model = Sequential()

model.add(Embedding(num_words, 300, input_length=max_len))
model.add(Dropout(0.5))

model.add(GRU(16,activation='relu'))

model.add(Dense(128))
model.add(Dropout(0.5))
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.layers[0].set_weights([weights])
model.layers[0].trainable = False

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [20]:
model.fit(X_train, y_train, batch_size=32, nb_epoch=10, verbose=1,
          validation_data=(X_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 [==============================] - 1789s - loss: 0.6452 - acc: 0.6176 - val_loss: 0.4956 - val_acc: 0.7578
Epoch 2/10
25000/25000 [==============================] - 1681s - loss: 0.4425 - acc: 0.8010 - val_loss: 0.3688 - val_acc: 0.8408
Epoch 3/10
25000/25000 [==============================] - 1937s - loss: 0.3265 - acc: 0.8652 - val_loss: 0.2933 - val_acc: 0.8745
Epoch 4/10
25000/25000 [==============================] - 1374s - loss: 0.2726 - acc: 0.8897 - val_loss: 0.3251 - val_acc: 0.8671
Epoch 5/10
25000/25000 [==============================] - 1329s - loss: 0.2406 - acc: 0.9061 - val_loss: 0.3481 - val_acc: 0.8610
Epoch 6/10
25000/25000 [==============================] - 1299s - loss: 0.2180 - acc: 0.9158 - val_loss: 0.2715 - val_acc: 0.8908
Epoch 7/10
 4416/25000 [====>.........................] - ETA: 1091s - loss: 0.1809 - acc: 0.9305
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-20-9f0ab85bc1e5> in <module>()
      1 model.fit(X_train, y_train, batch_size=32, nb_epoch=10, verbose=1,
----> 2           validation_data=(X_test, y_test))

/Users/taylor/anaconda3/lib/python3.5/site-packages/keras/models.py in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, **kwargs)
    395                               shuffle=shuffle,
    396                               class_weight=class_weight,
--> 397                               sample_weight=sample_weight)
    398 
    399     def evaluate(self, x, y, batch_size=32, verbose=1,

/Users/taylor/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight)
   1009                               verbose=verbose, callbacks=callbacks,
   1010                               val_f=val_f, val_ins=val_ins, shuffle=shuffle,
-> 1011                               callback_metrics=callback_metrics)
   1012 
   1013     def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):

/Users/taylor/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in _fit_loop(self, f, ins, out_labels, batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics)
    747                 batch_logs['size'] = len(batch_ids)
    748                 callbacks.on_batch_begin(batch_index, batch_logs)
--> 749                 outs = f(ins_batch)
    750                 if type(outs) != list:
    751                     outs = [outs]

/Users/taylor/anaconda3/lib/python3.5/site-packages/keras/backend/theano_backend.py in __call__(self, inputs)
    486     def __call__(self, inputs):
    487         assert type(inputs) in {list, tuple}
--> 488         return self.function(*inputs)
    489 
    490 

/Users/taylor/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py in __call__(self, *args, **kwargs)
    857         t0_fn = time.time()
    858         try:
--> 859             outputs = self.fn()
    860         except Exception:
    861             if hasattr(self.fn, 'position_of_error'):

/Users/taylor/anaconda3/lib/python3.5/site-packages/theano/gof/op.py in rval(p, i, o, n)
    909         if params is graph.NoParams:
    910             # default arguments are stored in the closure of `rval`
--> 911             def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
    912                 r = p(n, [x[0] for x in i], o)
    913                 for o in node.outputs:

KeyboardInterrupt: