LXMLS 2017 - Day 5

Deep Learning I

Deep learning is the name behind the latest wave of neural network research. This is a very old topic, dating from the first half of the 20th century, that has attained formidable impact in the machine learning community recently. There is nothing particularly difficult in deep learning. You have already visited all the mathematical principles you need in the first days of the labs of this school. At their core, deep learning models are just functions mapping vector inputs x to vector outputs y, constructed by composing linear and non-linear functions. This composition can be expressed in the form of a computation graph, where each node applies a function to its inputs and passes the result as its output. The parameters of the model are the weights given to the different inputs in nodes applying linear functions. This vaguely resembles synapse strengths in human neural networks, hence the name artificial neural networks.

Due to their compositional nature, gradient methods and the chain rule can be applied learn the parameters of these models regardless of their complexity. See Section for a refresh on the basic concept. We will also refer to the gradient learning methods introduced in Section 1.4.4. Today we will focus on feed-forward networks. Tomorrow we will extended today’s class to recurrent neural networks (RNNs).

Some of the changes that led to the surge of deep learning are not only improvements on the existing neural network algorithms, but also the increase in the amount of data available and computing power. In particular, the use of Graphical Processing Units (GPUs) has allowed neural networks to be applied to very large datasets. Working with GPUs is not trivial as it requires dealing with specialized hardware. Luckily, as it is often the case, we are one Python import away from solving this problem. For the particular case of deep learning, there is a growing number of python toolboxes available that allow you to design custom computational graphs for GPUs as e.g. Theano1 or TensorFlow2 .

In these labs we will be working with Theano. Theano allows us to express computation graphs symbolically in terms of basic algebraic operations. It also automatically computes gradients and produces CUDAcompatible code for GPUs. The exercises are designed to gain a low-level understanding of Theano. If you are only looking forward to utilize pre-designed models, the Keras toolbox3 provides high-level operations compatible with both Theano and TensorFlow.

Exercise 5.1 - Start by loading Amazon sentiment corpus used in day 1


In [1]:
import sys
sys.path.append('../../../')
import numpy as np
import lxmls.readers.sentiment_reader as srs
scr = srs.SentimentCorpus("books")
train_x = scr.train_X.T
train_y = scr.train_y[:, 0]
test_x = scr.test_X.T
test_y = scr.test_y[:, 0]


2000
1600

Go to lxmls/deep learning/mlp.py:class NumpyMLP:def grads() and complete the code of the NumpyMLP class with the Backpropagation recursion that we just saw.


In [1]:
def grads(self, x, y):
        """
       Computes the gradients of the network with respect to cross entropy
       error cost
       """

        # Run forward and store activations for each layer
        activations = self.forward(x, all_outputs=True)

        # For each layer in reverse store the gradients for each parameter
        nabla_params = [None] * (2*self.n_layers)

        for n in np.arange(self.n_layers-1, -1, -1):

            # Get weigths and bias (always in even and odd positions)
            # Note that sometimes we need the weight from the next layer
            W = self.params[2*n]
            b = self.params[2*n+1]
            if n != self.n_layers-1:
                W_next = self.params[2*(n+1)]
            
            ######################
            # Solving Exercise 5.1
            ######################
            if n < self.n_layers - 1 :
                ent = np.dot(W_next.T, ent )
                ent *= activations[ n ] * (1 - activations[ n ] ) # This is correct but confusing n+1 is n in the guide
            else:  # NOTE: This assumes cross entropy cost
                if self.actvfunc[ n ] == 'sigmoid':
                    ent = ( activations[ n ] - y ) / y.shape[ 0 ]
                elif self.actvfunc[ n ] == 'softmax':
                    I = index2onehot( y , W.shape[ 0 ] )
                    ent = (activations[ n ] - I ) / y.shape[ 0 ]

            nabla_W = np.zeros( W.shape )
            for l in np.arange( ent.shape[ 1 ] ):
                if n == 0:
                    nabla_W += np.outer( ent[ :, l ], x[ :, l] )
                else:
                    nabla_W += np.outer( ent[ :, l ], activations[ n - 1] [ :, l ] )
            nabla_b = np.sum( ent , 1, keepdims=True )
            
            ########################
            #End of the solution 5.1
            ########################
            
            # Store the gradients
            nabla_params[2*n] = nabla_W
            nabla_params[2*n+1] = nabla_b

        return nabla_params

Once you are done. Try different network geometries by increasing the number of layers and layer sizes e.g.


In [11]:
# Neural network modules
import lxmls.deep_learning.mlp as dl
import lxmls.deep_learning.sgd as sgd
# Model parameters
geometry = [train_x.shape[0], 20, 2]
actvfunc = ['sigmoid', 'softmax']
# Instantiate model
mlp = dl.NumpyMLP(geometry, actvfunc)

You can test the different hyperparameters


In [12]:
# Model parameters
n_iter = 5
bsize = 5
lrate = 0.01
# Train
sgd.SGD_train(mlp, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
acc_train = sgd.class_acc(mlp.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp.forward(test_x), test_y)[0]
print "MLP (%s) Amazon Sentiment Accuracy train: %f test: %f" % (geometry, acc_train,acc_test)


Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  1/ 5 in 1.81 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  2/ 5 in 1.89 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  3/ 5 in 1.76 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  4/ 5 in 1.92 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  5/ 5 in 1.82 seg
 
MLP ([13989, 20, 2]) Amazon Sentiment Accuracy train: 0.964375 test: 0.780000

5.4.4 Some final reflections on Backpropagation If you are new to the neural network topic, this is about the most important piece of theory you should learn about deep learning. Here are some reflections that you should keep in mind.

  • Thanks to the multi-layer structure and the chain rule, Backpropagation allows models that compose linear and non-linear functions with any depth (in principle5 ).
  • The formulas are also valid for other cost functions and output layer non-linearities with minor modifi- cations. It is only necessary to compute the equivalent of Eq. 5.13.
  • The formulas are also valid for hidden non-linearities other than the sigmoid. Element-wise non-linear transformations still allow the simplification in Eq. 5.21. With little effort it is also possible to deal with other cases.
  • However, there is an important limitation: unlike the log-linear models, the optimization problem is non convex. This removes some formal guarantees, most importantly we can get trapped in local minima during training

5.5 Deriving gradients and GPU code with Theano

5.5.1 An Introduction to Theano As you may have observed, the speed of SGD training for MLPs slows down considerably when we increase the number of layers. One reason for this is that the code that we use here is not very optimized. It is thought for you to learn the basic principles. Even if the code was more optimized, it would still be very slow for reasonable network sizes. The cost of computing each linear layer is proportional to the dimensionality of the previous and current layers, which in most cases will be rather large.

For this reason most deep learning applications use Graphics Processing Units (GPU) in their computations. This specialized hardware is normally used to accelerate computer graphics, but can also be used for general computation intensive tasks. However, we need to deal with specific interfaces and operations in order to use a GPU. This is where Theano comes in. Theano is a multidimensional symbolic expression python module with focus on neural networks. It will provide us with the following nice features:

  • Symbolic expressions: Express the operations of the MLP (forward pass, cost) symbolically, as mathematical operations rather than explicit code
  • Symbolic Differentiation: As a consequence of the previous feature, we can compute gradients of arbitrary mathematical functions automatically.
  • GPU integration: The code will be ready to work on a GPU, provided that you have one and it is active within Theano. It will also be faster on normal CPUs since the symbolic operations are compiled to C code.
  • Theano is focused on Deep Learning, with an active community and several tutorials easily available. However, this does not come at a free price. There are a number of limitations
  • Symbolic algebra is more difficult to debug, as we can not easily step in at each operation.
  • Working with CPU and GPU code implies that we have to be more careful about the types of the variables.
  • Theano tends to output long error messages. However, once you get used to it, error messages accurately point the source of the problem.
  • Handling recurrent neural networks is much simpler than in Numpy but it still implies working with complicated constructs that are complicated to debug.

Exercise 5.2 Get in contact with Theano. Learn the difference between a symbolic representation and a function. Start by implementing the first layer of our previous MLP in Numpy


In [13]:
# Numpy code
x = test_x # Test set
W1, b1 = mlp.params[:2] # Weights and bias of fist layer
z1 = np.dot(W1, x) + b1 # Linear transformation
tilde_z1 = 1/(1+np.exp(-z1)) # Non-linear transformation

Now we will implement this in Theano. We start by creating the variables over which we will produce the operations. For example the symbolic input is defined as


In [14]:
# Theano code.
# NOTE: We use undescore to denote symbolic equivalents to Numpy variables.
# This is no Python convention!.
import theano
import theano.tensor as T
_x = T.matrix('x')

In [15]:
_W1 = theano.shared(value=W1, name='W1', borrow=True)
_b1 = theano.shared(value=b1, name='b1', borrow=True,broadcastable=(False, True))

Important: One of the main differences between Numpy and theano data is broadcast. In Numpy if we sum an array with shape (N, M) to one with shape (1, M), the second array will be copied N times to form a (N, M) matrix. This is known as broadcasting. In Theano this is not automatic. You need to specify broadcasting explicitly. This is important for example when using a bias, which will be copied to match the number of examples in the batch. In other cases, like when using variables for recurrent neural networks, broadcast has to be set to False. Broadcast is one of the typical source of errors when you start working with Theano. Keep this in mind. Now lets describe the operations we want to do with the variables. Again only symbolically. This is done by replacing our usual operations by Theano symbolic ones when necessary e. g. the internal product dot() or the sigmoid. Some operations like e.g. + are automatically recognized by Theano (operator overloading).


In [16]:
_z1 = T.dot(_W1, _x) + _b1
_tilde_z1 = T.nnet.sigmoid(_z1)
# Keep in mind that naming variables is useful when debugging
_z1.name = 'z1'
_tilde_z1.name = 'tilde_z1'

In [17]:
# Show computation graph
print "\nThis is my symbolic perceptron\n"
theano.printing.debugprint(_tilde_z1)


This is my symbolic perceptron

sigmoid [id A] 'tilde_z1'   
 |Elemwise{add,no_inplace} [id B] 'z1'   
   |dot [id C] ''   
   | |W1 [id D]
   | |x [id E]
   |b1 [id F]

In [18]:
# Compile
layer1 = theano.function([_x], _tilde_z1)

In [19]:
# Check Numpy and Theano mactch
if np.allclose(tilde_z1, layer1(x.astype(theano.config.floatX))):
    print "\nNumpy and Theano Perceptrons are equivalent"
else:
    set_trace()
    # raise ValueError, "Numpy and Theano Perceptrons are different"


Numpy and Theano Perceptrons are equivalent

5.5.2 Symbolic Forward Pass

Exercise 5.3 Complete the method forward() inside of the lxmls/deep learning/mlp.py:class TheanoMLP. Note that this is called only once at the initialization of the class. To debug your implementation put a breakpoint at the init function call. Hint: Note that this is very similar to NumpyMLP.forward(). You just need to keep track of the symbolic variable representing the output of the network after each layer is applied and compile the function at the end. After you are finished instantiate a Theano class and check that Numpy and Theano forward pass are the same.


In [2]:
def _forward(self, x, all_outputs=False):
    """
    Symbolic forward pass

    all_outputs = True  return symbolic input and intermediate activations
    """

    # This will store activations at each layer and the input. This is
    # needed to compute backpropagation
    if all_outputs:
        activations = [x]

        # Input
    tilde_z = x

    ##########################
    # Solution to Exercise 5.3
    ##########################
    for n in range(self.n_layers):

        # Get weigths and bias (always in even and odd positions)
        W = self.params[2*n]
        b = self.params[2*n+1]

        z = T.dot(W, tilde_z) + b # Linear transformation

        # see e.g. theano.printing.debugprint(tilde_z)
        z.name = 'z%d' % (n+1)

        # Non-linear transformation
        if self.actvfunc[n] == "sigmoid":
            tilde_z = T.nnet.sigmoid( z )
        elif self.actvfunc[n] == "softmax":
            tilde_z = T.nnet.softmax( z.T ).T

        tilde_z.name = 'tilde_z%d' % (n+1) # Name variable

        if all_outputs:
            activations.append(tilde_z)
    #################################
    # End of solution to Exercise 5.3
    #################################

    if all_outputs:
        tilde_z = activations

    return tilde_z

In [21]:
mlp_a = dl.NumpyMLP(geometry, actvfunc)
mlp_b = dl.TheanoMLP(geometry, actvfunc)

To help debugging in Theano is sometimes useful to switch off the optimizer. This helps Theano point out which part of the Python code generated the error


In [22]:
theano.config.optimizer='None'

In [24]:
assert np.allclose(mlp_a.forward(test_x), mlp_b.forward(test_x)),"ERROR: Numpy and Theano forward passes differ"

5.5.3 Symbolic Differentiation

Exercise 5.4 We first see an example that does not use any of the code in TheanoMLP but rather continues from what you wrote in Ex. 5.2. In this exercise you completed a sigmoid layer with Theano. To get some values for the weights we used the first layer of the network you trained in 5.2. Now we are going to use the second layer as well. This is thus assuming that your network in 5.2 has only two layers e.g. the recommended geometry (I, 20, 2). Make sure this is the case before starting this exercise.

For the sake of clarity, lets write here the part of Ex. 5.2 that we had completed


In [25]:
# Get the values from our MLP
W1, b1 = mlp.params[:2] # Weights and bias of fist layer
# First layer symbolic variables
_x = T.matrix('x')
_W1 = theano.shared(value=W1, name='W1', borrow=True)
_b1 = theano.shared(value=b1, name='b1', borrow=True, broadcastable=(False, True))
# First layer symbolic expressions
_z1 = T.dot(_W1, _x) + _b1
_tilde_z1 = T.nnet.sigmoid(_z1)

Now we just need to complete this with the second layer, using a softmax non-linearity


In [26]:
W2, b2 = mlp.params[2:] # Weights and bias of second (and last!) layer
# Second layer symbolic variables
_W2 = theano.shared(value=W2, name='W2', borrow=True)
_b2 = theano.shared(value=b2, name='b2', borrow=True, broadcastable=(False, True))
# Second layer symbolic expressions
_z2 = T.dot(_W2, _tilde_z1) + _b2
# NOTE: Theano softmax does not support T.nnet.softmax(_z2, axis=1) this is a workaround
_tilde_z2 = T.nnet.softmax(_z2.T).T

With this, we could compile a function to obtain the output of the network symb tilde z2 for a given input symb x. In this exercise we are however interested in obtaining the misclassification cost. This is given in Eq: 5.5. First we are going to need the symbolic variable for the correct output


In [27]:
_y = T.ivector('y')

The minus posterior probability of the class given the input is the same as selecting the k(m)-th softmax output, were k(m) is the index of the correct class for xm. If we want to do this for a vector y containing M different examples, we can write this as


In [29]:
_F = -T.mean(T.log(_tilde_z2[_y, T.arange(_y.shape[0])]))

Now obtaining a function that computes the gradient could not be easier


In [30]:
_nabla_F = T.grad(_F, _W1)
nabla_F = theano.function([_x, _y], _nabla_F)

5.5.4 Symbolic mini-batch update

Exercise 5.4 Define the updates list. This is a list where each element is a tuple of a parameter and the update rule to be applied that parameter. In this case we are defining the SGD update rule, but take into account that using more complex update rules like e.g. momentum or adam implies just replacing the last line of the following snippet.


In [31]:
W2, b2 = mlp_a.params[2:4]

# Second layer symbolic variables
_W2 = theano.shared(value=W2, name='W2', borrow=True)
_b2 = theano.shared(value=b2, name='b2', borrow=True,
                    broadcastable=(False, True))
_z2 = T.dot(_W2, _tilde_z1) + _b2
_tilde_z2 = T.nnet.softmax(_z2.T).T

# Ground truth
_y = T.ivector('y')

# Cost
_F = -T.mean(T.log(_tilde_z2[_y, T.arange(_y.shape[0])]))

# Gradient
_nabla_F = T.grad(_F, _W1)
nabla_F = theano.function([_x, _y], _nabla_F)

# Print computation graph
print "\nThis is my softmax classification cost\n"
theano.printing.debugprint(_F)


This is my softmax classification cost

Elemwise{neg,no_inplace} [id A] ''   
 |Elemwise{true_div,no_inplace} [id B] 'mean'   
   |Sum{acc_dtype=float64} [id C] ''   
   | |Elemwise{log,no_inplace} [id D] ''   
   |   |AdvancedSubtensor [id E] ''   
   |     |InplaceDimShuffle{1,0} [id F] ''   
   |     | |Softmax [id G] ''   
   |     |   |InplaceDimShuffle{1,0} [id H] ''   
   |     |     |Elemwise{add,no_inplace} [id I] ''   
   |     |       |dot [id J] ''   
   |     |       | |W2 [id K]
   |     |       | |sigmoid [id L] ''   
   |     |       |   |Elemwise{add,no_inplace} [id M] ''   
   |     |       |     |dot [id N] ''   
   |     |       |     | |W1 [id O]
   |     |       |     | |x [id P]
   |     |       |     |b1 [id Q]
   |     |       |b2 [id R]
   |     |y [id S]
   |     |ARange{dtype='int64'} [id T] ''   
   |       |TensorConstant{0} [id U]
   |       |Subtensor{int64} [id V] ''   
   |       | |Shape [id W] ''   
   |       | | |y [id S]
   |       | |Constant{0} [id X]
   |       |TensorConstant{1} [id Y]
   |Subtensor{int64} [id Z] ''   
     |Elemwise{Cast{float64}} [id BA] ''   
     | |Shape [id BB] ''   
     |   |Elemwise{log,no_inplace} [id D] ''   
     |Constant{0} [id BC]

In [35]:
import time

# Understanding the mini-batch function and givens/updates parameters

# Numpy
geometry = [train_x.shape[0], 20, 2]
actvfunc = ['sigmoid', 'softmax']
mlp_a = dl.NumpyMLP(geometry, actvfunc)
#
init_t = time.clock()
sgd.SGD_train(mlp_a, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
print "\nNumpy version took %2.2f sec" % (time.clock() - init_t)
acc_train = sgd.class_acc(mlp_a.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_a.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)

# Theano grads
mlp_b = dl.TheanoMLP(geometry, actvfunc)
init_t = time.clock()
sgd.SGD_train(mlp_b, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
print "\nCompiled gradient version took %2.2f sec" % (time.clock() - init_t)
acc_train = sgd.class_acc(mlp_b.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_b.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)

# Theano compiled batch

# Cast data into the types and shapes used in the theano graph
# IMPORTANT: This is the main source of errors when beginning with theano
train_x = train_x.astype(theano.config.floatX)
train_y = train_y.astype('int32')

# Model
mlp_c = dl.TheanoMLP(geometry, actvfunc)

# Define givens variables to be used in the batch update
# Get symbolic variables returning a mini-batch of data

# Define updates variable. This is a list of gradient descent updates
# The output is a list following theano.function updates parameter. This
# consists on a list of tuples with each parameter and update rule
_x = T.matrix('x')
_y = T.ivector('y')
_F = mlp_c._cost(_x, _y)
updates = [(par, par - lrate*T.grad(_F, par)) for par in mlp_c.params]

#
# Define the batch update function. This will return the cost of each batch
# and update the MLP parameters at the same time using updates
batch_up = theano.function([_x, _y], _F, updates=updates)
n_batch = int(np.ceil(float(train_x.shape[1])/bsize)) 
#

init_t = time.clock()
sgd.SGD_train(mlp_c, n_iter, batch_up=batch_up, n_batch=n_batch, bsize=bsize,
              train_set=(train_x, train_y))
print "\nTheano compiled batch update version took %2.2f sec" % (time.clock() - init_t)
init_t = time.clock()

acc_train = sgd.class_acc(mlp_c.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_c.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)


Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  1/ 5 in 1.64 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  2/ 5 in 1.55 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  3/ 5 in 1.59 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  4/ 5 in 1.64 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  5/ 5 in 1.60 seg
 

Numpy version took 8.02 sec
Amazon Sentiment Accuracy train: 0.964375 test: 0.780000

Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  1/ 5 in 1.80 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  2/ 5 in 1.86 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  3/ 5 in 1.88 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  4/ 5 in 1.89 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  5/ 5 in 1.89 seg
 

Compiled gradient version took 9.32 sec
Amazon Sentiment Accuracy train: 0.964375 test: 0.780000

Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  1/ 5 in 0.92 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  2/ 5 in 1.01 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  3/ 5 in 0.99 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  4/ 5 in 1.00 seg
Batch 320/320 (100%)                                                                                                                                                                                                                                                                                                                              Epoch  5/ 5 in 1.01 seg
 

Theano compiled batch update version took 4.93 sec
Amazon Sentiment Accuracy train: 0.964375 test: 0.780000