In the next cell, we introduce Keras, a high-level library for machine learning which we will use for the rest of the class. Keras is built on top of Tensorflow, which is an open-source framework which impelments machine learning methodology, particularly that of deep neural networks, by optimizing the efficiency of the computation. We do not have to deal so much with the details of this. For our purposes, Tensorflow is also a very low-level library which is not necessarily accessible to the typical engineer. Keras solves this by creating a wrapper around Tensorflow, reducing the complexity of coding neural networks, and giving us a set of convenient functions which implement lots of reusable routines. Most importantly, Keras (via Tensorflow) efficiently implement backpropagation to train neural networks on the GPU. Effectively, you could say that Keras is to Tensorflow what Processing is to Java.
To start, we will re-implement what we did in the last section, a neural network to classify the Iris dataset, but this time we will use Keras.
Start by importing the relevant Keras libraries that we will be using, as well as matplotlib and numpy.
In [46]:
import os
import matplotlib.pyplot as plt
import numpy as np
import random
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
Let's load the Iris dataset again.
In [47]:
from sklearn.datasets import load_iris
iris = load_iris()
data, labels = iris.data[:,0:3], iris.data[:,3]
In the last lesson, we manually trained a neural network to predict the sepal width of the Iris flowers. This time, let's use the Keras library instead. First we need to shuffle and pre-process the data. Pre-processing in this case is normalization of the data, as well as converting it to a properly-shaped numpy array.
In [48]:
num_samples = len(labels) # size of our dataset
shuffle_order = np.random.permutation(num_samples)
data = data[shuffle_order, :]
labels = labels[shuffle_order]
# normalize data and labels to between 0 and 1 and make sure it's float32
data = data / np.amax(data, axis=0)
data = data.astype('float32')
labels = labels / np.amax(labels, axis=0)
labels = labels.astype('float32')
# print out the data
print("shape of X", data.shape)
print("first 5 rows of X\n", data[0:5, :])
print("first 5 labels\n", labels[0:5])
In our previous guides, we always evaluated the performance of the network on the same data that we trained it on. But this is wrong; our network could learn to "cheat" by overfitting to the training data (like memorizing it) so as to get a high score, but then not generalize well to actually unknown examples.
In machine learning, this is called "overfitting" and there are several things we have to do to avoid it. The first thing is we must split our dataset into a "training set" which we train on with gradient descent, and a "test set" which is hidden from the training process that we can do a final evaluation on to get the true accuracy, that of the network trying to predict unknown samples.
Let's split the data into a training set and a test set. We'll keep the first 30% of the dataset to use as a test set, and use the rest for training.
In [49]:
# let's rename the data and labels to X, y
X, y = data, labels
test_split = 0.3 # percent split
n_test = int(test_split * num_samples)
x_train, x_test = X[n_test:, :], X[:n_test, :]
y_train, y_test = y[n_test:], y[:n_test]
print('%d training samples, %d test samples' % (x_train.shape[0], x_test.shape[0]))
In Keras, to instantiate a neural network model, we use the Sequential
class. Sequential simply means a model with a sequence of layers which propagate in one direction, from input to output.
In [141]:
model = Sequential()
We now have an empty neural network called model
. Now let's add our first layer, which will be our input layer. We will do this using Keras's Dense
class which will instantiate our input layer.
The reason why it is called "Dense" is that the layer is "fully-connected," which means that all of it's neurons are connected to all the neurons in the previous layer, with no empty connections. This may seem confusing at first because we have not yet seen neural network layers which are not fully-connected; we will see this in the next chapter when we introduce convolutional networks.
To create a Dense layer, we have two arguments that need to be specified: the number of neurons and the activation function (which non-linearity, if any). For the first layer, we must also specify the input dimension.
In [142]:
model.add(Dense(8, activation='sigmoid', input_dim=3))
We can also get a readout of the current state of the network using model.summary
:
In [143]:
model.summary()
Our network currently has one layer with 32 parameters: that's 3 neurons in the input layer, times 8 neurons in the middle layer (3x8=24), plus 8 biases (24+8=32).
Next, we will add the output layer, which will be a fully-connected (Dense) layer whose size is 1 neuron. This neuron will contain our final output.
Notice that this time, instead of having the activation be a sigmoid as before, we leave it as a "linear" activation (no non-linearity). This is common for the final output.
We add it, and look at the final summary.
In [144]:
model.add(Dense(1, activation='linear'))
model.summary()
So we've added 9 parameters, 8x1 weights between the hidden and output layers, and 1 bias in the output. So we have 41 parameters in total.
Now we are finished specifying the architecture of the model. Now we need to specify our loss function and optimizer, and then compile the model. Let's discuss each of these things.
First, we specify the loss. The standard for regression, as we said before is sum-squared error (SSE) or mean-squared error (MSE). SSE and MSE are basically the same, since the only difference between them is a scaling factor ($\frac{1}{n}$) which doesn't depend on the final weights. Keras happens to use MSE for evaluation rather than SEE, so we will use that.
The optimizer is the flavor of gradient descent we want. The most basic optimizer is "stochastic gradient descent" or SGD which is the learning algorithm we have used so far. We have mostly used batch gradient descent so far, which means we compute our gradient over the entire dataset. For reasons which will be more clear when we cover learning algorithms in more detail, this is not usually favored, and we instead calculate the gradient over random subsets of the training data, called mini-batches.
Once we've specified our loss function and optimizer, the model is compiled. Compiling means that Keras (actually Tensorflow internally) is allocating memory for a "computational graph" whose architecture is that which is specified by your model definition. This is done for optimization purposes, and a full understanding of how that's done is not necessary for this course and is beyond its scope.
In [145]:
model.compile(loss='mean_squared_error', optimizer='sgd')
We are finally ready to train. In the next cell, we run the fit
command which will begin the process of training. There are several important arguments to fit
. The first is the training data and labels (x_train
and y_train
), as well as the validation set (x_test
and y_test
).
Additionally, we must specify the batch_size
which refers to the number of training samples to calculate the gradient over (using SGD), as well as the number of epochs
, which refers to the number of times we cycle through the training set. In general, more epochs are usually better, although in practice, the accuracy of the network may stop improving early, which makes it unnecessary to train for too many epochs.
Because we have a very small dataset (just 105 samples), we should have a low batch size and can afford to train over many epochs (let's set to 200).
In [146]:
history = model.fit(x_train, y_train,
batch_size=4,
epochs=200,
verbose=1,
validation_data=(x_test, y_test))
As you can see above, we train our network down to a validation MSE < 0.01. Notice that both the training loss ("loss") and validation loss ("val_loss") are reported. It's normal for the training loss to be lower than the validation loss, since the network's objective is to predict the training data well. But if the training loss is much lower than our validation loss, it means we are overfitting and may not expect to receive very good results.
We can evaluate the training set one last time at the end using evaluate
.
In [147]:
score = model.evaluate(x_test, y_test)
print('Test loss:', score)
To get the raw predictions:
In [150]:
y_pred = model.predict(x_test)
for yp, ya in list(zip(y_pred, y_test))[0:10]:
print("predicted %0.2f, actual %0.2f" % (yp, ya))
We can manually calculate MSE as a sanity check:
In [152]:
def MSE(y_pred, y_test):
return (1.0/len(y_test)) * np.sum([((y1[0]-y2)**2) for y1, y2 in list(zip(y_pred, y_test))])
print("MSE is %0.4f" % MSE(y_pred, y_test))
We can also predict the value of a single unknown example or a set of them in th following way:
In [154]:
x_sample = x_test[0].reshape(1, 3) # shape must be (num_samples, 3), even if num_samples = 1
y_prob = model.predict(x_sample)
print("predicted %0.3f, actual %0.3f" % (y_prob[0][0], y_test[0]))
We've now finished introducing Keras for regression. Note it is a far more powerful way of training neural networks than our own. Keras's strengths will become even more apparent when we introduce classification in the next lesson, as well as introduce convolutional networks and various other optimization tricks it enables for us.