Express Deep Learning in Python: Advanced Layers

The Dense layer is only one of the possible core layers of Keras. Dense is a forward layer, this are the ones that take an input and do some transformation on it (in this case a matrix multiplication).

Other important layers to consider are: activation layers, regularization layers, dropout layers, convolutional layers, pooling layers, recurrent layers, normalization layers, embedding layers, noise layers, etc.

For this tutorial we will focus on some layers to aid in the tuning of the network: activations, regularizers and dropout; as well as the layers needed to design convolutional neural networks: convolutional and pooling layers.

We will point out other tutorials and examples to learn about the other kind of layers at the end of this tutorial.


In [1]:
import numpy
import keras

from keras import backend as K
from keras import losses, optimizers, regularizers
from keras.datasets import mnist
from keras.layers import Activation, ActivityRegularization, Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from keras.models import Sequential
from keras.utils.np_utils import to_categorical


Using TensorFlow backend.

Activation Functions

A neural network classifier with linear activations has no more representation power than a logistic regression classifier. In order to express non-linearity with a neural network model a non-linear function is needed as activation function for each neuron.

One simple activation function to use is the sigmoid (or logistic) function, the same one used in the logistic regression algorithm, which restricts the output value to be between zero and one. This was one of the most common nonlinearities used as activation function in some of the first versions of neural networks. There are however other possibilities (all the following available in Keras, but there are more which can be adapted):

  • rectified linear unit (ReLU)
  • tanh
  • hard sigmoid
  • softsign
  • softplus
  • exponential linear unit (elu)
  • scaled exponential linear unit (selu)
  • leaky rectifier linear unit (Leaky ReLU)
  • parametric rectified linear unit (PReLU)

Activation Functions Examples

Source: https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/

Of these, the one most used in the present state-of-the-art neural networks classifiers is the ReLU, because tipically learns much faster in networks with many layers [1].

There is another activation layer which is the SoftMax activation. This is generally used as the last activation layer, i.e. as the output of the network. This function, also known as normalized exponential function is a generalization of the logistic function that "squashes" a K-dimensional vector ${\displaystyle \mathbf {z}}$ of arbitrary real values to a K-dimensional vector ${\displaystyle \sigma (\mathbf {z} )}$ of real values in the range [0, 1] that add up to 1.

Activation Functions in Keras

Keras provides two ways to define an activation function. Any method is equally valid.

Activation as a parameter of a forward layer


In [ ]:
model = Sequential()
model.add(Dense(64, input_shape=(784,), activation='relu'))
model.add(Dense(10, activation='softmax'))

Activation as a layer


In [ ]:
model = Sequential()
model.add(Dense(64, input_shape=(784,)))
model.add(Activation('tanh'))
model.add(Dense(10))
model.add(Activation('softmax'))

Activation from a TensorFlow function

In the previous examples we used some of the available functions in the Keras library.

We can also use an element-wise TensorFlow function as activation.


In [ ]:
model = Sequential()
model.add(Dense(64, input_shape=(784,),
                activation=K.sigmoid))
model.add(Dense(10, activation='softmax'))

Regularizers

Regularizers allow to apply penalties on layer parameters or layer activity during optimization. These penalties are incorporated in the loss function that the network optimizes. The penalties are applied on a per-layer basis.

The regularizers can be applied to three parameters:

  • Weight/kernel matrix regularization: Applies the regularizer function to the weight matrix (called kernel matrix in Keras documentation).
  • Bias regularization: Applies the regularizer to the bias vector.
  • Activity regularizer: Applies the regularizer to the output (i.e. the activation function).

There are three possible penalties to apply as regularizers already present in Keras (but the API permits the definition of a custom regularizer) [2]: l1, l2 and elasticnet.

Regularizers in Keras

As with activation functions, there are two ways to use a regularizer in keras. Although not for all the parameters.

Regularization as parameter of a layer

This is the most practical way and the only one which allows the individual regularization of each available parameter.

The regularizer is given as a parameter of the layers (e.g. Dense):

  • kernel_regularizer: Regularization of the weight matrix.
  • bias_regularizer: Regularization of the bias vector.
  • activity_regularizer: Regularization of the total output.

The available penalties for this case are:

  • keras.regularizers.l1: L1 norm or "sum of weights".
  • keras.regularizers.l2: L2 norm or "sum of weights squared".
  • keras.regularizers.l1_l2: Linear combination of L1 and L2 penalties or "elastic net regularization".

For more information on the difference between L1 and L2 see [5].


In [ ]:
model = Sequential()
model.add(Dense(64, input_shape=(784,),
                activation='relu',
                kernel_regularizer=regularizers.l2(0.01),
                activity_regularizer=regularizers.l1(0.01)))
model.add(Dense(10, activation='softmax'))

Regularization as a layer

The core layer ActivityRegularization is another way to apply regularization, in this case (as the name indicates), only for the activation function (not for the weight matrix or the bias vector).


In [ ]:
model = Sequential()
model.add(Dense(64, input_shape=(784,), activation='relu'))
model.add(ActivityRegularization(l1=0.01, l2=0.1))
model.add(Dense(10, activation='softmax'))

Dropout

This are special layers useful for regularization which randomly drop (i.e. set to zero) units of the neural network during training. This prevents units from co-adapting too much to the input [3].

Keras has a special layer which can be added to a sequential model which takes a value rate, between 0 and 1, and sets the fraction given by the value to 0 during training of the input.


In [ ]:
model = Sequential()
model.add(Dense(64, input_shape=(784,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

Convolutional Neural Networks

CNNs were responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today, from Facebook's automated photo tagging to self-driving cars [6].

What is convolution?

A simple way to think about it is as a sliding window function applied to a matrix:

Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Imagine that the matrix on the left represents an black and white image. Each entry corresponds to one pixel, 0 for black and 1 for white (typically it's between 0 and 255 for grayscale images). The sliding window is called a kernel, filter, or feature detector. Here we use a 3×3 filter, multiply its values element-wise with the original matrix, then sum them up. To get the full convolution we do this for each element by sliding the filter over the whole matrix.

There are different uses for a convolution, particularly in images: averaging each pixel with its neighboring values blurs an image; taking the difference between a pixel and its neighbors detects edges; etc. For a better understanding of how a convolution work we recommend Chris Olah's post.

What are convolutional neural networks?

CNNs are basically just several layers of convolutions with nonlinear activation functions (e.g. ReLU or tanh) applied to the results.

In a traditional feedforward neural network we connect each input neuron to each output neuron in the next layer. These are fully connected layer (or Dense layers). In CNNs we don't do that. Instead, we use convolutions over the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results.

During the training phase, a CNN automatically learns the values of its filters based on the task you want to perform. For example, in Image Classification a CNN may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers. The last layer is then a classifier that uses these high-level features.

Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

CNN Hyperparameters

Narrow vs. wide convolution

Applying a 3x3 filter at the center of the matrix works fine, but what about the edges? How would you apply the filter to the first element of a matrix that doesn't have any neighboring elements to the top and left? You can use zero-padding. All elements that would fall outside of the matrix are taken to be zero. By doing this you can apply the filter to every element of your input matrix, and get a larger or equally sized output. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution.

Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

The previous example shows the difference between narrow and wide convolution for 1 dimension for an input size of 7 and a filter size of 5.

Stride size

Another hyperparameter for your convolutions is the stride size, defining by how much you want to shift your filter at each step. A larger stride size leads to fewer applications of the filter and a smaller output size. The typical stride size is 1. The following example shows the differente outputs of a convolution for different size of strides (stride size of 1 vs stride size of 2).

Source: http://cs231n.github.io/convolutional-networks/

Channels

Channels are different "views" of your input data. For example, in image recognition you typically have RGB (red, green, blue) channels. You can apply convolutions across channels, either with different or equal weights.

Pooling layers

A key aspect of Convolutional Neural Networks are pooling layers, typically applied after the convolutional layers. Pooling layers subsample their input. The most common way to do pooling it to apply a max operation to the result of each filter. You don't necessarily need to pool over the complete matrix, you could also pool over a window. For example, the following shows max pooling for a 2x2 window:

Source: http://cs231n.github.io/convolutional-networks/#pool

One property of pooling is that it provides a fixed size output matrix, which typically is required for classification. For example, if you have 1,000 filters and you apply max pooling to each, you will get a 1000-dimensional output, regardless of the size of your filters, or the size of your input. This allows you to use variable size sentences, and variable size filters, but always get the same output dimensions to feed into a classifier. Pooling also reduces the output dimensionality but (hopefully) keeps the most salient information. You can think of each filter as detecting a specific feature.

CNNs in Keras

Keras has many different kinds of convolutional layers. The most commonly used for doing spatial convolution over images is keras.layers.convolutional.Conv2D. The layer takes as arguments the number output of filters in the convolution, the size of the 2D convolution window, the strides of the convolution and the padding.

Keras also is shipped with many different pooling layers. For spatial data the layer is keras.layers.pooling.MaxPooling2D. This layer takes the pool size and the data format. The data format corresponds to whether the channels are the first (i.e. the input has shape (batch, channels, height, width)) or the last (i.e. the input has shape (batch, height, width, channels)) dimension (this one is the default for Keras with a TensorFlow backend).

Finally, there is a layer, which doesn't take any parameters and serves as the connection between the convolutional layers and the dense layers, which is keras.layers.core.Flatten() which basically flattens the input to one dimension (without affecting the batch size, i.e. the number of examples to use for training/classifying).


In [ ]:
model = Sequential()

# input: 100x100 images with 3 channels -> (100, 100, 3) tensors.
# this applies 32 convolution filters of size 3x3 each.

model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(100, 100, 3)))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

Compiling the model: loss functions and optimizers

When compiling a model there are two important parameters: the loss function and the optimizer algorithm. Both of them depend on the problem and can change the performance of the model.

Loss function

Also know as the objective function, is the function we want to optimize when training the algorithm (that is find the minimum). Depending on the task (whether it is classification or regression), and some other parameters, the objective function can change. Two of the most popular objective functions are the mean squared error for regression and categorical crossentropy for classification. Keras bring a number of different loss functions already available [4], but for this course we will be using only the categorical crossentropy (since we have a classification task to work with).

Optimizer

The optimizer algorithm is the way to find the minimum values to the loss function. As with loss functions, there are many available optimizers already packaged with Keras. One of the most popular algorithms is stochastic gradient descent (or SGD) optimizer, which is also one of the simplest to understand. However, in this tutorial we will be exploring other optimizers (e.g. RMSProp, Adam, Adadelta, etc.) which give better results.

Loss function and optimizer in Keras

In Keras, is the .compile() method of a model which takes as parameters the loss function and the optimizer. The parameters can either be instances of a loss function (e.g. keras.losses.hinge_loss) or an optimizer (e.g. keras.optimizers.RMSprop), or a string calling the loss function/optimizer by the name.

In the case of loss functions, the advantage of using an instance of a function is to have a custom defined loss function besides the ones given by Keras. E.g. you can pass a TensorFlow symbolic function that returns a scalar for each data-point and takes two arguments: the true labels and the predicted labels.

For optimizers, the main difference between an instance and a string is that in the latter case the optimizer will have default parameter values. Besides, there is a wrapper class (keras.optimizers.TFOptimizer) for native TensorFlow optimizers.

Loss function/optimizer as a string


In [ ]:
model = Sequential()
model.add(Dense(64, input_shape=(784,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Loss function/optimizer as an instance


In [ ]:
# Simple 1 layer denoising autoencoder

model = Sequential()
model.add(Dense(200, input_shape=(784,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(784))

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss=losses.mean_squared_error, optimizer=sgd)
Categorical format

In case of using a loss function for classification (e.g. the categorical crossentropy) having more than 2 classes, Keras requires the targets to be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample). In order to convert integer targets into categorical targets, you can use the Keras utility keras.utils.np_utils.to_categorical to transform an input vector of integers into a matrix of one-hot encoding representations.

Wrapping up

Finally, to end the tutorial, we use what we have learned so far and use this to create a new classifier for the MNIST dataset.


In [2]:
batch_size = 128
num_classes = 10
epochs = 10
TRAIN_EXAMPLES = 20000
TEST_EXAMPLES = 5000

# image dimensions
img_rows, img_cols = 28, 28

# load the data (already shuffled and splitted)
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# reshape the data to add the "channels" dimension
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

# normalize the input in the range [0, 1]
# to make quick runs, select a smaller set of images.
train_mask = numpy.random.choice(x_train.shape[0], TRAIN_EXAMPLES, replace=False)
x_train = x_train[train_mask, :].astype('float32')
y_train = y_train[train_mask]
test_mask = numpy.random.choice(x_test.shape[0], TEST_EXAMPLES, replace=False)
x_test = x_test[test_mask, :].astype('float32')
y_test = y_test[test_mask]

x_train /= 255
x_test /= 255

print('Train samples: %d' % x_train.shape[0])
print('Test samples: %d' % x_test.shape[0])

# convert class vectors to binary class matrices
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

# define the network architecture
model = Sequential()
model.add(Conv2D(filters=16,
                 kernel_size=(3, 3),
                 strides=(1,1),
                 padding='valid',
                 activation='relu',
                 input_shape=input_shape,
                 activity_regularizer='l2'))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

# compile the model
model.compile(loss=losses.categorical_crossentropy,
              optimizer=optimizers.RMSprop(),
              metrics=['accuracy'])

# train the model
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))

# evaluate the model
score = model.evaluate(x_test, y_test, verbose=0)

print('Test loss: %.2f' % score[0])
print('Test accuracy: %.2f' % (100. * score[1]))


Train samples: 20000
Test samples: 5000
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 11s - loss: 13.4005 - acc: 0.7447 - val_loss: 0.8993 - val_acc: 0.8636
Epoch 2/10
20000/20000 [==============================] - 11s - loss: 0.7521 - acc: 0.8427 - val_loss: 0.5580 - val_acc: 0.8974
Epoch 3/10
20000/20000 [==============================] - 11s - loss: 0.5090 - acc: 0.8902 - val_loss: 0.3500 - val_acc: 0.9392
Epoch 4/10
20000/20000 [==============================] - 11s - loss: 0.3863 - acc: 0.9179 - val_loss: 0.2513 - val_acc: 0.9524
Epoch 5/10
20000/20000 [==============================] - 11s - loss: 0.3118 - acc: 0.9325 - val_loss: 0.2782 - val_acc: 0.9454
Epoch 6/10
20000/20000 [==============================] - 11s - loss: 0.2664 - acc: 0.9420 - val_loss: 0.1942 - val_acc: 0.9560
Epoch 7/10
20000/20000 [==============================] - 11s - loss: 0.2383 - acc: 0.9466 - val_loss: 0.2401 - val_acc: 0.9348
Epoch 8/10
20000/20000 [==============================] - 11s - loss: 0.2206 - acc: 0.9492 - val_loss: 0.1691 - val_acc: 0.9648
Epoch 9/10
20000/20000 [==============================] - 11s - loss: 0.2022 - acc: 0.9539 - val_loss: 0.1709 - val_acc: 0.9684
Epoch 10/10
20000/20000 [==============================] - 11s - loss: 0.1888 - acc: 0.9555 - val_loss: 0.1318 - val_acc: 0.9748
Test loss: 0.10
Test accuracy: 97.48

References