Keras is a high-level neural networks library for Python capable of running on top of TensorFlow, Theano and other lower level frameworks. What makes keras special is that it is extremely user friendly: its syntax focuses on the big ideas, and takes care of a lot of the detailed plumbing, that can make these topics seems extremely complicated. For these reasons it is the ideal deep-learning package for beginners (although the fast-experimentation enabled by its simplicity has made it popular among serious researchers too).
The keras package is broken down into multiple parts, each describing a different part of neural network pipeline.
keras.models
): This governs the overall type of architecture of the neural network. In our case, we go from a single input to output, and therefore use a sequential
model. Other model types for reccurrent networks etc can also be found here.keras.layers
): neural networks are built up as asequence of layers. These layers can be of many types Fully connected dense
layers as we use here, and 2D convolutional (conv2D
) layers that you will see soon are two well known examples.keras.optimizers
): This is what keras uses to learn the neural network weights etc from the training data.keras.datasets
): Like scikit learn, keras comes with several popular machine learning data sets to make it easy to benchmark and test our neural network architectures.keras.utils
): Various utility functions to make our lives easier. Here we will use a function to draw a diagram depicting our network.
In [1]:
# Lets import the various libraries we need
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from keras.utils.vis_utils import plot_model
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
For this exercise (and the next one) will will be using the classic MNIST dataset. Like the digits data you had used earlier, these are pictures of numbers. The big difference is the pictures are bigger (28x28 pixels rather than 8x8) and we will be looking at a much larger data set: 60,000 training images and 10,000 to test. See sample pictures by running the code below.
Data structure:
x_train
of dimensionality 60,000 x 784 (and a 10,000 x 784 matrix x_test
for the test data)y_test
and y_train
, which is a number between 0 and 9 corresponding to the true digit.Thus num_classes=10
y_test
is represented by a 10 dimensional vector, with a 1 corresponding to the correct label and all the others zero.x_test
for the test data)
In [2]:
# Load Mnist data
# the data, split between train and test sets
num_classes = 10 # we have 10 digits to classify
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
# Plot sample pictures
fig = plt.figure(figsize=(8, 8))
classNum=np.zeros((y_train.shape[0],1))
for i in range(y_train.shape[0]):
classNum[i]=np.where(y_train[i,:])
meanVecs=np.zeros((num_classes,784))
nReps=10
counter=1
for num in range(num_classes):
idx =np.where(classNum==num)[0]
meanVecs[num,:]=np.mean(x_train[idx,:],axis=0)
for rep in range(nReps):
mat=x_train[idx[rep],:]
mat=(mat.reshape(28,28))
ax = fig.add_subplot(num_classes,nReps,counter)
plt.imshow(mat,cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
counter=counter+1
plt.suptitle('Sample Numbers')
plt.show()
# Plot "Average" Pictures
fig = plt.figure(figsize=(8, 8))
for num in range(num_classes):
mat=(meanVecs[num,:].reshape(28,28))
ax = fig.add_subplot(3,4,num+1)
plt.imshow(mat,cmap=plt.cm.binary)
plt.suptitle('Average Numbers')
plt.show()
We're now ready to build our first neural network model. We'll begin with the easiest model imaginable given the data. A network where eeach one of the 784 pixels in the input data is connected to each of the 10 output classes.
In this case there is one set of input going forward to a single set of outputs with no feedback. This is called a sequential model. We can thus create the "shell" for such a model by declaring:
In [10]:
model = Sequential()
The variable model
will contain all the information on the model, and as we add layers and perform calculations, it will all be stored insidle this variable.
In general adding a layer to a mode is achieved by the model.add(layer_definition)
command. Where the layer_definition is specific to the kind of layer we want to add.
Layers are defined by 4 traits:
Dense
for a fully-connected layer, conv2d
of a 2D convolutional layer and so on. num_classes=10
input_shape=
parameter.relu
is the rectified-linear-unit transform. For a layer connecting to the output, we want to enforce a binary-ish response, and hence the last-layer will often contain a softmax
activation.Note: in our super-simple first case, the first layer is essentially the last layer. which why we end up providing both the input_shape=
and using softmax
. This will not be the case for deeper networks.
So we can add our (Dense
) fully connected layer with num_classes
outputs, softmax
activation and 784 pixel inputs as:
In [11]:
model.add(Dense(num_classes, activation='softmax', input_shape=(784,)))
Note the input_shape=(784,)
here. 784 is the size of out input images. The (784,)
means we can have an unspecified number of input images coming into the network.
Normally, we would keep adding more layers (see examples later). But in this case we only have one layer. So we can see what our model looks like by invoking:
In [12]:
model.summary()
Whenever you build a model be sure to look at the summary. The number of trainable parameters gives you a sense of model complexity. Models with fewer parameters are likely to be easier to train, and likely to generalize better for the same performance.
This step describes how keras will update the model during the training phase (this is defining some of the parameters for this calculation).
There are two choices we make here:
categorical_crossentropy
is a popular choice that penalizes high-confidence mistakes.RMSprop()
a popular choice. The parameters of the optimizer, such as learning rate, can also be specified as parameters to the optimization function, but we will use default values for now.We also choose to report on the accuracy during the optimization process, to keep track of progress:
In [13]:
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
We're now ready to perform the training. A couple more decisions need to be made:
With these parameters we can fit the model to our training data as follows. This will produce a nice plot updating the accuracy (% of correctly classified points in the training data):
In [14]:
batch_size = 128
epochs = 20
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1)
The real test of a model is how well it does on the training data (a severly overfit model could get 100% accuracy on training data, but fail miserably on new data). So the real score we care about is the performance on un-seen data. We do this by evaluating performance on the test data we held out so far. The code below also makes a plot showing the confidence curve on the training data, and a flat line corresponding to the performance on the new data.
In [15]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
fig = plt.figure()
trainCurve=plt.plot(history.history['acc'],label='Training')
testCurve=plt.axhline(y=score[1],color='k',label='Testing')
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend()
plt.show()
Since we used an incredibly simple network, it is easy to look under the covers and see what is going on. For each class (i.e. digit=1 to 9) this simple network essentially has a filter-matrix the same size as the image, which it "multiplies" an input image against. The digit for which the filter produces the highest response is the chosen digit. We can look at the filter matrices by looking at the weights of the first layer.
In [20]:
denseLayer=model.layers[0]
weights=denseLayer.get_weights()
topWeights=weights[0]
fig = plt.figure(figsize=(15, 15))
meanTop=np.mean(topWeights,axis=1)
for num in range(topWeights.shape[1]):
mat=topWeights[:,num]-meanTop
mat=(mat.reshape(28,28))
ax = fig.add_subplot(3,4,num+1)
plt.imshow(mat,cmap=plt.cm.binary)
plt.show()
#fig=plt.figure()
#plt.imshow(meanTop.reshape(28,28),cmap=plt.cm.binary)
#plt.show()
Compare the results of these weights to the mean digit images at the beginning of this notebook:
We will build a deeper 3-layer network, following exactly the same steps, but with 2-additional layers thrown in. FYI, this topology comes from one of the examples provided by keras at https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py
In [23]:
# Model initialization unchanged
model = Sequential()
# Input layer is similar, but because it doesn't connect to the final layer we are free
# to choose number of outputs (here 512) and we use 'relu' activation instead of softmax. It also
model.add(Dense(512, activation='relu', input_shape=(784,)))
# New intermediate layer connecting the 512 inputs from the previous layer to the 512 new outputs
model.add(Dense(512, activation='relu'))
# The 512 inputs from the previous layer connect to the final 10 class outputs, with a softmax activation
model.add(Dense(num_classes, activation='softmax'))
# Compilation as before
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
#plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
#Image("model_plot.png")
model.summary()
Note the dramatic increase in the number of parameters achieved with this network. When you train this network, you'll see that this increase in number of parameters really slows things down, but leads to an increase in performance:
In [24]:
batch_size = 128
epochs = 20
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1)
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
fig = plt.figure()
trainCurve=plt.plot(history.history['acc'],label='Training')
testCurve=plt.axhline(y=score[1],color='k',label='Testing')
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend()
plt.show()
We can get some clues by looking at the input layer as before. Although now there are 512 instead of 10 filters. So we just look at the first 10.
In [28]:
denseLayer=model.layers[0]
weights=denseLayer.get_weights()
topWeights=weights[0]
fig = plt.figure(figsize=(15, 15))
meanTop=np.mean(topWeights,axis=1)
for num in range(10):
mat=topWeights[:,num]-meanTop
mat=(mat.reshape(28,28))
ax = fig.add_subplot(3,4,num+1)
plt.imshow(mat,cmap=plt.cm.binary)
plt.show()
What is the quaitative difference between these plots and
As you can see with the increase in the number of parameters, there is an increase in the mismatch between training and testing, raising the concern that the model is overfitting on the training data. There are a couple of approaches to combat this.
Dropout is a way of introducing noise into the model to prevent overfitting. It is implemented by adding dropout layers. These layers randomly set a specified fraction of input units to zeros (at each update during training). Below is the same model implemented with dropout layers added.
In [29]:
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
# NEW dropout layer dropping 20% of inputs
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
# NEW dropout layer dropping 20% of inputs
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
# Compilation as before
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
#plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
#Image("model_plot.png")
model.summary()
batch_size = 128
epochs = 20
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1)
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
fig = plt.figure()
trainCurve=plt.plot(history.history['acc'],label='Training')
testCurve=plt.axhline(y=score[1],color='k',label='Testing')
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend()
plt.show()