MultiLayer Perceptron with Python and Theano for Document Classification

In this Python notebook, for self-teaching purposes, I will develop a MultiLayer Perceptron and use it later to train a Bag-of-Words text classifier for the 20 newsgroup dataset.



In [1]:

    
import numpy as np
import re
import theano
import theano.tensor as T

from nltk import corpus
from sklearn.metrics import accuracy_score, classification_report
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction import text

np.random.seed(123)  # For reproducibility

Fetching the 20 Newsgroups dataset and processing it

We continue by fetching and processing the 20 Newsgroup dataset, all the subsets, filter the stopwords and convert each document in a Bag-of-Words with CountVectorizer. For simplicity, we consideration words formed only by lower case letters (with no numbers nor punctuation symbols), and we only take the 10000 most frequent words.



In [2]:

    
def process_newsgroups_document(document):
    # To simplify, we ignore everything thas isn't a word
    document = re.sub(r"[^a-zA-Z]", " ", document)

    # We only make use of lower case words
    words = document.lower().split()
    
    # We filter out every stopword for the english language
    stopwords = set(corpus.stopwords.words("english"))
    document = " ".join([word for word in words if word not in stopwords])

    return document

newsgroups = fetch_20newsgroups(subset='all')
vectorizer = text.CountVectorizer(analyzer='word', preprocessor=process_newsgroups_document, max_features=5000)
newsgroups_dataset = vectorizer.fit_transform(newsgroups.data).todense().astype(theano.config.floatX)
newsgroups_target = newsgroups.target
ng_X_train, ng_X_test, ng_y_train, ng_y_test = train_test_split(newsgroups_dataset, newsgroups_target, test_size=0.2)

# Convert the data to theano shared variables
ng_X_train = theano.shared(ng_X_train, borrow=True)
ng_y_train = theano.shared(ng_y_train, borrow=True)

Neural Network Parameters

We define all the parameters for the neural network.



In [ ]:

    
N = newsgroups_dataset.shape[0]  # Number of examples in the dataset.
n_input = newsgroups_dataset.shape[1]  # Number of features of the dataset. Input of the Neural Network.
n_output = len(newsgroups.target_names)  # Number of classes in the dataset. Output of the Neural Network.
n_h1 = 2500  # Size of the first layer
n_h2 = 1000  # Size of the second layer
alpha = 0.01  # Learning rate parameter
lambda_reg = 0.01  # Lambda value for regularization
epochs = 500  # Number of epochs for gradient descent
batch_size = 128  # Size of the minibatches to perform sgd
train_batches = ng_X_train.get_value().shape[0] / batch_size

Theano's Computation Graph

We have to define the computation graph from Theano.

First we define the symbolic variables for the layers.



In [ ]:

    
# Stateless variables to handle the input
index = T.lscalar('index')  # Index of a minibatch
X = T.matrix('X')
y = T.lvector('y')

# First layer weight matrix and bias
W1 = theano.shared(
    value=np.random.uniform(
        low=-np.sqrt(6. / (n_input + n_h1)),
        high=np.sqrt(6. / (n_input + n_h1)),
        size=(n_input, n_h1)
    ).astype(theano.config.floatX),
    name='W1',
    borrow=True
)
b1 = theano.shared(
    value=np.zeros((n_h1,), dtype=theano.config.floatX),
    name='b1',
    borrow=True
)

# Second layer weight matrix and bias
W2 = theano.shared(
    value=np.random.uniform(
        low=-np.sqrt(6. / (n_h1 + n_h2)),
        high=np.sqrt(6. / (n_h1 + n_h2)),
        size=(n_h1, n_h2)
    ).astype(theano.config.floatX),
    name='W2',
    borrow=True
)
b2 = theano.shared(
    value=np.zeros((n_h2,), dtype=theano.config.floatX),
    name='b2',
    borrow=True
)

# Output layer weight matrix and bias
W3 = theano.shared(
    value=np.random.uniform(
        low=-np.sqrt(6. / (n_h2 + n_output)),
        high=np.sqrt(6. / (n_h2 + n_output)),
        size=(n_h2, n_output)
    ).astype(theano.config.floatX),
    name='W3',
    borrow=True
)
b3 = theano.shared(
    value=np.zeros((n_output,), dtype=theano.config.floatX),
    name='b3',
    borrow=True
)

Next we define the flow for forward propagation. This means, we define all the activation layers.

The activation layers a1 and a2 are non-linearities (we use tanh, but could also be sigmoid, relu or another nonlinearity) while the activation a3 is really the output of our network so it is a Softmax layer.

Finally y_pred returns the actual class, by getting the argmax value of the softmax for all the N examples.



In [ ]:

    
z1 = T.dot(X, W1) + b1  # Size: N x n_h1
a1 = T.tanh(z1)  # Size: N x n_h1

z2 = T.dot(a1, W2) + b2  # Size: N x n_h2
a2 = T.tanh(z2)  # Size: N x n_h2

z3 = T.dot(a2, W3) + b3  # Size: N x n_output
y_out = T.nnet.softmax(z3)  # Size: N x n_output

y_pred = T.argmax(y_out, axis=1) # Size: N

Now that we have our computation graph ready, we only need to define our loss function, for which we use the negative log likelihood, also known as the cross entropy. We also add a regularization term to avoid overfitting of the network.



In [ ]:

    
# Regularization term to sum in the loss function
loss_reg = 1./N * lambda_reg/2 * (T.sum(T.sqr(W1)) + T.sum(T.sqr(W2)) + T.sum(T.sqr(W3)))

# Loss function
loss = T.nnet.categorical_crossentropy(y_out, y).mean() + loss_reg

We have defined all the steps for our forward propagation, however we still need to define the function in Theano that does the forward propagation. As the variables, all the functions in Theano are symbolic, we need to define them by setting up their inputs, their outputs and, if we want, we can also define extra information like updates.



In [ ]:

    
# Define the functions
forward_propagation = theano.function([X], y_out)
loss_calculation = theano.function([X, y], loss)
predict = theano.function([X], y_pred)

This functions are now callable as they have been compiled by Theano. As the weight matrices have been initialized randomly (i.e. we still have to train them), if we try to predict the values of a couple of instances, most likely we will end up with random values.



In [ ]:

    
# The probabilities for each class given the first 2 examples of the newsgroup dataset
print forward_propagation(newsgroups_dataset[:2])

# The prediction for each of this 2 examples
print predict(newsgroups_dataset[:2])

# The loss function value for each of this 2 examples (most likely a high one)
print loss_calculation(newsgroups_dataset[:2], newsgroups.target[:2])

Now we have the forward propagation and the loss function all set, we need to train the network in order to better classify the documents. In order to do this, we first need to get the gradients of the matrices and bias vectors. Theano do this for us (instead of having to make use of backpropagation to calculate the gradients).



In [ ]:

    
dJdW1 = T.grad(loss, wrt=W1)
dJdb1 = T.grad(loss, wrt=b1)
dJdW2 = T.grad(loss, wrt=W2)
dJdb2 = T.grad(loss, wrt=b2)
dJdW3 = T.grad(loss, wrt=W3)
dJdb3 = T.grad(loss, wrt=b3)

As all the weight matrices and bias vectors are defined as Theano's shared variables, they can be updated in functions. We define then a gradient_step function to do so.

The gradient_step actually uses minibatch stochastic gradient descent to provide a better performance.



In [ ]:

    
updates = [
    (W1, W1 - alpha * dJdW1),  # Update step. W1 = W1 - alpha * dJdW1
    (b1, b1 - alpha * dJdb1),  # Update step. b1 = b1 - alpha * dJdb1
    (W2, W2 - alpha * dJdW2),  # Update step. W2 = W2 - alpha * dJdW2
    (b2, b2 - alpha * dJdb2),  # Update step. b2 = b2 - alpha * dJdb2
    (W3, W3 - alpha * dJdW3),  # Update step. W3 = W3 - alpha * dJdW3
    (b3, b3 - alpha * dJdb3),  # Update step. b3 = b3 - alpha * dJdb3
]

gradient_step = theano.function(
    inputs=[index],
    outputs=loss,
    updates=updates,
    givens={
        X: ng_X_train[index * batch_size: (index + 1) * batch_size],
        y: ng_y_train[index * batch_size: (index + 1) * batch_size]
    }
)

For the optimization algorithm, using gradient descent, in this first approach we train with the full batch. This is not a good idea however, since will lead to performance issues. We will then retry this with a better approach using minibatches stochastic gradient descent.



In [ ]:

    
for i in xrange(epochs):  # We train for epochs times
    for mini_batch in xrange(train_batches):
        gradient_step(mini_batch)

    if i % 50 == 0:
        print "Loss for iteration {}: {}".format(
            i, loss_calculation(ng_X_train.get_value(), ng_y_train.get_value())
        )

Once we are done training the model we print some results



In [ ]:

    
predictions = predict(ng_X_test)

print "Accuracy: {:.3f}".format(accuracy_score(ng_y_test, predictions))

print "Classification report"
print classification_report(ng_y_test, predictions, target_names=newsgroups.target_names)



In [ ]: