Convolution1D for Text Classification

This example demonstrates the use of Convolution1D for text classification.

Wait! But ConvoNets are all about images and spacial concepts. How can they be applied to Text Classification?

Introducing...

Translational Invariance

Translational invariance means that a system is "agnostic" with respect to its location in time, space, or some other variable.

Convolutional neural networks are able to preserve spatial structure within images AND extrapolate these features into new and different positions and orientations.

Translational invariance can be used on a one-dimensional sequence of words, such as those from a movie review. The same properties that make the CNN model attractive for learning to recognize objects in images can help to learn structure in paragraphs of words.

Can you predict the sentiment of a movie review (as either positive or negative)?

Let's look at the feature engineering and training of a CNN model for classifying reviews as positive or negative.


In [46]:
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb

import numpy as np
import matplotlib.pyplot as pyplot

In [37]:
# set parameters:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2

The keras.datasets.imdb.load_data() allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

The words have been replaced by integers that indicate the absolute popularity of the word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

Load Data


In [38]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

Pad sequences (samples x time)


In [39]:
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

Let's explore the data set a bit:

There are two classes our data is split into (representing good and bad sentiment):


In [9]:
print(x_train)


[[   0    0    0 ...,   19  178   32]
 [   0    0    0 ...,   16  145   95]
 [   0    0    0 ...,    7  129  113]
 ..., 
 [   0    0    0 ...,    4 3586    2]
 [   0    0    0 ...,   12    9   23]
 [   0    0    0 ...,  204  131    9]]

In [40]:
print(np.unique(y_train))


[0 1]

In [41]:
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


25000 train sequences
25000 test sequences
x_train shape: (25000, 400)
x_test shape: (25000, 400)

Let's build a Model


In [8]:
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))


Build model...

Dropout?

Dropout is a technique used to prevent overfitting. It works by "dropping out" or ignoring or "switching off" the inputs from certain nodes in the network.


In [42]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 157s - loss: 0.4060 - acc: 0.8000 - val_loss: 0.2950 - val_acc: 0.8745
Epoch 2/2
25000/25000 [==============================] - 157s - loss: 0.2390 - acc: 0.9036 - val_loss: 0.2769 - val_acc: 0.8858
Out[42]:
<keras.callbacks.History at 0x1116d7cc0>

Model Summary


In [43]:
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 400, 50)           250000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 250)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 350,751
Trainable params: 350,751
Non-trainable params: 0
_________________________________________________________________
None

Let's evaluate our model's performance


In [47]:
scores = model.evaluate(x_test, y_test, verbose=0)

print("Accuracy: %.2f%%" % (scores[1]*100))


Accuracy: 88.58%

More about IMDB CNN in Keras Here


In [ ]: