Deep learning is the name behind the latest wave of neural network research. This is a very old topic, dating from the first half of the 20th century, that has attained formidable impact in the machine learning community recently. There is nothing particularly difficult in deep learning. You have already visited all the mathematical principles you need in the first days of the labs of this school. At their core, deep learning models are just functions mapping vector inputs x to vector outputs y, constructed by composing linear and non-linear functions. This composition can be expressed in the form of a computation graph, where each node applies a function to its inputs and passes the result as its output. The parameters of the model are the weights given to the different inputs in nodes applying linear functions. This vaguely resembles synapse strengths in human neural networks, hence the name artificial neural networks.
Due to their compositional nature, gradient methods and the chain rule can be applied learn the parameters of these models regardless of their complexity. See Section for a refresh on the basic concept. We will also refer to the gradient learning methods introduced in Section 1.4.4. Today we will focus on feed-forward networks. Tomorrow we will extended today’s class to recurrent neural networks (RNNs).
Some of the changes that led to the surge of deep learning are not only improvements on the existing neural network algorithms, but also the increase in the amount of data available and computing power. In particular, the use of Graphical Processing Units (GPUs) has allowed neural networks to be applied to very large datasets. Working with GPUs is not trivial as it requires dealing with specialized hardware. Luckily, as it is often the case, we are one Python import away from solving this problem. For the particular case of deep learning, there is a growing number of python toolboxes available that allow you to design custom computational graphs for GPUs as e.g. Theano1 or TensorFlow2 .
In these labs we will be working with Theano. Theano allows us to express computation graphs symbolically in terms of basic algebraic operations. It also automatically computes gradients and produces CUDAcompatible code for GPUs. The exercises are designed to gain a low-level understanding of Theano. If you are only looking forward to utilize pre-designed models, the Keras toolbox3 provides high-level operations compatible with both Theano and TensorFlow.
In [1]:
import sys
sys.path.append('../../../')
import numpy as np
import lxmls.readers.sentiment_reader as srs
scr = srs.SentimentCorpus("books")
train_x = scr.train_X.T
train_y = scr.train_y[:, 0]
test_x = scr.test_X.T
test_y = scr.test_y[:, 0]
Go to lxmls/deep learning/mlp.py:class NumpyMLP:def grads() and complete the code of the NumpyMLP class with the Backpropagation recursion that we just saw.
In [1]:
def grads(self, x, y):
"""
Computes the gradients of the network with respect to cross entropy
error cost
"""
# Run forward and store activations for each layer
activations = self.forward(x, all_outputs=True)
# For each layer in reverse store the gradients for each parameter
nabla_params = [None] * (2*self.n_layers)
for n in np.arange(self.n_layers-1, -1, -1):
# Get weigths and bias (always in even and odd positions)
# Note that sometimes we need the weight from the next layer
W = self.params[2*n]
b = self.params[2*n+1]
if n != self.n_layers-1:
W_next = self.params[2*(n+1)]
######################
# Solving Exercise 5.1
######################
if n < self.n_layers - 1 :
ent = np.dot(W_next.T, ent )
ent *= activations[ n ] * (1 - activations[ n ] ) # This is correct but confusing n+1 is n in the guide
else: # NOTE: This assumes cross entropy cost
if self.actvfunc[ n ] == 'sigmoid':
ent = ( activations[ n ] - y ) / y.shape[ 0 ]
elif self.actvfunc[ n ] == 'softmax':
I = index2onehot( y , W.shape[ 0 ] )
ent = (activations[ n ] - I ) / y.shape[ 0 ]
nabla_W = np.zeros( W.shape )
for l in np.arange( ent.shape[ 1 ] ):
if n == 0:
nabla_W += np.outer( ent[ :, l ], x[ :, l] )
else:
nabla_W += np.outer( ent[ :, l ], activations[ n - 1] [ :, l ] )
nabla_b = np.sum( ent , 1, keepdims=True )
########################
#End of the solution 5.1
########################
# Store the gradients
nabla_params[2*n] = nabla_W
nabla_params[2*n+1] = nabla_b
return nabla_params
Once you are done. Try different network geometries by increasing the number of layers and layer sizes e.g.
In [11]:
# Neural network modules
import lxmls.deep_learning.mlp as dl
import lxmls.deep_learning.sgd as sgd
# Model parameters
geometry = [train_x.shape[0], 20, 2]
actvfunc = ['sigmoid', 'softmax']
# Instantiate model
mlp = dl.NumpyMLP(geometry, actvfunc)
You can test the different hyperparameters
In [12]:
# Model parameters
n_iter = 5
bsize = 5
lrate = 0.01
# Train
sgd.SGD_train(mlp, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
acc_train = sgd.class_acc(mlp.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp.forward(test_x), test_y)[0]
print "MLP (%s) Amazon Sentiment Accuracy train: %f test: %f" % (geometry, acc_train,acc_test)
5.4.4 Some final reflections on Backpropagation If you are new to the neural network topic, this is about the most important piece of theory you should learn about deep learning. Here are some reflections that you should keep in mind.
5.5.1 An Introduction to Theano As you may have observed, the speed of SGD training for MLPs slows down considerably when we increase the number of layers. One reason for this is that the code that we use here is not very optimized. It is thought for you to learn the basic principles. Even if the code was more optimized, it would still be very slow for reasonable network sizes. The cost of computing each linear layer is proportional to the dimensionality of the previous and current layers, which in most cases will be rather large.
For this reason most deep learning applications use Graphics Processing Units (GPU) in their computations. This specialized hardware is normally used to accelerate computer graphics, but can also be used for general computation intensive tasks. However, we need to deal with specific interfaces and operations in order to use a GPU. This is where Theano comes in. Theano is a multidimensional symbolic expression python module with focus on neural networks. It will provide us with the following nice features:
Exercise 5.2 Get in contact with Theano. Learn the difference between a symbolic representation and a function. Start by implementing the first layer of our previous MLP in Numpy
In [13]:
# Numpy code
x = test_x # Test set
W1, b1 = mlp.params[:2] # Weights and bias of fist layer
z1 = np.dot(W1, x) + b1 # Linear transformation
tilde_z1 = 1/(1+np.exp(-z1)) # Non-linear transformation
Now we will implement this in Theano. We start by creating the variables over which we will produce the operations. For example the symbolic input is defined as
In [14]:
# Theano code.
# NOTE: We use undescore to denote symbolic equivalents to Numpy variables.
# This is no Python convention!.
import theano
import theano.tensor as T
_x = T.matrix('x')
In [15]:
_W1 = theano.shared(value=W1, name='W1', borrow=True)
_b1 = theano.shared(value=b1, name='b1', borrow=True,broadcastable=(False, True))
Important: One of the main differences between Numpy and theano data is broadcast. In Numpy if we sum an array with shape (N, M) to one with shape (1, M), the second array will be copied N times to form a (N, M) matrix. This is known as broadcasting. In Theano this is not automatic. You need to specify broadcasting explicitly. This is important for example when using a bias, which will be copied to match the number of examples in the batch. In other cases, like when using variables for recurrent neural networks, broadcast has to be set to False. Broadcast is one of the typical source of errors when you start working with Theano. Keep this in mind. Now lets describe the operations we want to do with the variables. Again only symbolically. This is done by replacing our usual operations by Theano symbolic ones when necessary e. g. the internal product dot() or the sigmoid. Some operations like e.g. + are automatically recognized by Theano (operator overloading).
In [16]:
_z1 = T.dot(_W1, _x) + _b1
_tilde_z1 = T.nnet.sigmoid(_z1)
# Keep in mind that naming variables is useful when debugging
_z1.name = 'z1'
_tilde_z1.name = 'tilde_z1'
In [17]:
# Show computation graph
print "\nThis is my symbolic perceptron\n"
theano.printing.debugprint(_tilde_z1)
In [18]:
# Compile
layer1 = theano.function([_x], _tilde_z1)
In [19]:
# Check Numpy and Theano mactch
if np.allclose(tilde_z1, layer1(x.astype(theano.config.floatX))):
print "\nNumpy and Theano Perceptrons are equivalent"
else:
set_trace()
# raise ValueError, "Numpy and Theano Perceptrons are different"
Exercise 5.3 Complete the method forward() inside of the lxmls/deep learning/mlp.py:class TheanoMLP. Note that this is called only once at the initialization of the class. To debug your implementation put a breakpoint at the init function call. Hint: Note that this is very similar to NumpyMLP.forward(). You just need to keep track of the symbolic variable representing the output of the network after each layer is applied and compile the function at the end. After you are finished instantiate a Theano class and check that Numpy and Theano forward pass are the same.
In [2]:
def _forward(self, x, all_outputs=False):
"""
Symbolic forward pass
all_outputs = True return symbolic input and intermediate activations
"""
# This will store activations at each layer and the input. This is
# needed to compute backpropagation
if all_outputs:
activations = [x]
# Input
tilde_z = x
##########################
# Solution to Exercise 5.3
##########################
for n in range(self.n_layers):
# Get weigths and bias (always in even and odd positions)
W = self.params[2*n]
b = self.params[2*n+1]
z = T.dot(W, tilde_z) + b # Linear transformation
# see e.g. theano.printing.debugprint(tilde_z)
z.name = 'z%d' % (n+1)
# Non-linear transformation
if self.actvfunc[n] == "sigmoid":
tilde_z = T.nnet.sigmoid( z )
elif self.actvfunc[n] == "softmax":
tilde_z = T.nnet.softmax( z.T ).T
tilde_z.name = 'tilde_z%d' % (n+1) # Name variable
if all_outputs:
activations.append(tilde_z)
#################################
# End of solution to Exercise 5.3
#################################
if all_outputs:
tilde_z = activations
return tilde_z
In [21]:
mlp_a = dl.NumpyMLP(geometry, actvfunc)
mlp_b = dl.TheanoMLP(geometry, actvfunc)
To help debugging in Theano is sometimes useful to switch off the optimizer. This helps Theano point out which part of the Python code generated the error
In [22]:
theano.config.optimizer='None'
In [24]:
assert np.allclose(mlp_a.forward(test_x), mlp_b.forward(test_x)),"ERROR: Numpy and Theano forward passes differ"
Exercise 5.4 We first see an example that does not use any of the code in TheanoMLP but rather continues from what you wrote in Ex. 5.2. In this exercise you completed a sigmoid layer with Theano. To get some values for the weights we used the first layer of the network you trained in 5.2. Now we are going to use the second layer as well. This is thus assuming that your network in 5.2 has only two layers e.g. the recommended geometry (I, 20, 2). Make sure this is the case before starting this exercise.
For the sake of clarity, lets write here the part of Ex. 5.2 that we had completed
In [25]:
# Get the values from our MLP
W1, b1 = mlp.params[:2] # Weights and bias of fist layer
# First layer symbolic variables
_x = T.matrix('x')
_W1 = theano.shared(value=W1, name='W1', borrow=True)
_b1 = theano.shared(value=b1, name='b1', borrow=True, broadcastable=(False, True))
# First layer symbolic expressions
_z1 = T.dot(_W1, _x) + _b1
_tilde_z1 = T.nnet.sigmoid(_z1)
Now we just need to complete this with the second layer, using a softmax non-linearity
In [26]:
W2, b2 = mlp.params[2:] # Weights and bias of second (and last!) layer
# Second layer symbolic variables
_W2 = theano.shared(value=W2, name='W2', borrow=True)
_b2 = theano.shared(value=b2, name='b2', borrow=True, broadcastable=(False, True))
# Second layer symbolic expressions
_z2 = T.dot(_W2, _tilde_z1) + _b2
# NOTE: Theano softmax does not support T.nnet.softmax(_z2, axis=1) this is a workaround
_tilde_z2 = T.nnet.softmax(_z2.T).T
With this, we could compile a function to obtain the output of the network symb tilde z2 for a given input symb x. In this exercise we are however interested in obtaining the misclassification cost. This is given in Eq: 5.5. First we are going to need the symbolic variable for the correct output
In [27]:
_y = T.ivector('y')
The minus posterior probability of the class given the input is the same as selecting the k(m)-th softmax output, were k(m) is the index of the correct class for xm. If we want to do this for a vector y containing M different examples, we can write this as
In [29]:
_F = -T.mean(T.log(_tilde_z2[_y, T.arange(_y.shape[0])]))
Now obtaining a function that computes the gradient could not be easier
In [30]:
_nabla_F = T.grad(_F, _W1)
nabla_F = theano.function([_x, _y], _nabla_F)
Exercise 5.4 Define the updates list. This is a list where each element is a tuple of a parameter and the update rule to be applied that parameter. In this case we are defining the SGD update rule, but take into account that using more complex update rules like e.g. momentum or adam implies just replacing the last line of the following snippet.
In [31]:
W2, b2 = mlp_a.params[2:4]
# Second layer symbolic variables
_W2 = theano.shared(value=W2, name='W2', borrow=True)
_b2 = theano.shared(value=b2, name='b2', borrow=True,
broadcastable=(False, True))
_z2 = T.dot(_W2, _tilde_z1) + _b2
_tilde_z2 = T.nnet.softmax(_z2.T).T
# Ground truth
_y = T.ivector('y')
# Cost
_F = -T.mean(T.log(_tilde_z2[_y, T.arange(_y.shape[0])]))
# Gradient
_nabla_F = T.grad(_F, _W1)
nabla_F = theano.function([_x, _y], _nabla_F)
# Print computation graph
print "\nThis is my softmax classification cost\n"
theano.printing.debugprint(_F)
In [35]:
import time
# Understanding the mini-batch function and givens/updates parameters
# Numpy
geometry = [train_x.shape[0], 20, 2]
actvfunc = ['sigmoid', 'softmax']
mlp_a = dl.NumpyMLP(geometry, actvfunc)
#
init_t = time.clock()
sgd.SGD_train(mlp_a, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
print "\nNumpy version took %2.2f sec" % (time.clock() - init_t)
acc_train = sgd.class_acc(mlp_a.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_a.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)
# Theano grads
mlp_b = dl.TheanoMLP(geometry, actvfunc)
init_t = time.clock()
sgd.SGD_train(mlp_b, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
print "\nCompiled gradient version took %2.2f sec" % (time.clock() - init_t)
acc_train = sgd.class_acc(mlp_b.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_b.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)
# Theano compiled batch
# Cast data into the types and shapes used in the theano graph
# IMPORTANT: This is the main source of errors when beginning with theano
train_x = train_x.astype(theano.config.floatX)
train_y = train_y.astype('int32')
# Model
mlp_c = dl.TheanoMLP(geometry, actvfunc)
# Define givens variables to be used in the batch update
# Get symbolic variables returning a mini-batch of data
# Define updates variable. This is a list of gradient descent updates
# The output is a list following theano.function updates parameter. This
# consists on a list of tuples with each parameter and update rule
_x = T.matrix('x')
_y = T.ivector('y')
_F = mlp_c._cost(_x, _y)
updates = [(par, par - lrate*T.grad(_F, par)) for par in mlp_c.params]
#
# Define the batch update function. This will return the cost of each batch
# and update the MLP parameters at the same time using updates
batch_up = theano.function([_x, _y], _F, updates=updates)
n_batch = int(np.ceil(float(train_x.shape[1])/bsize))
#
init_t = time.clock()
sgd.SGD_train(mlp_c, n_iter, batch_up=batch_up, n_batch=n_batch, bsize=bsize,
train_set=(train_x, train_y))
print "\nTheano compiled batch update version took %2.2f sec" % (time.clock() - init_t)
init_t = time.clock()
acc_train = sgd.class_acc(mlp_c.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_c.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)