The dataset is taken from Andrew Ng's course on Coursera Introduction to Neural Networks.
The dataset ("data.h5") contains:
Let's get more familiar with the dataset.
In [2]:
import numpy as np
import h5py
# Loading the data (cat/non-cat)
train_dataset = h5py.File('../datasets/train_catvnoncat.h5', "r")
train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # train set features
train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # train set labels
test_dataset = h5py.File('../datasets/test_catvnoncat.h5', "r")
test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # test set features
test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # test set labels
classes = np.array(test_dataset["list_classes"][:]) # the list of classes
train_set_y = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
Each line of train_set_x_orig and test_set_x_orig is an array representing an image. You can visualize an example by running the following code. Feel free also to change the index
value and re-run to see other images.
In [3]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
np.random.seed(1)
In [4]:
# Example of a picture
index = 50
plt.imshow(train_set_x_orig[index])
print ("y = " + str(train_set_y[:, index]) + ", it's a '" + classes[np.squeeze(train_set_y[:, index])].decode("utf-8") + "' picture.")
Dataset pre-processing:
Common steps for pre-processing a new dataset are:
Many software bugs in deep learning come from having matrix/vector dimensions that don't fit. If you can keep your matrix/vector dimensions straight you will go a long way toward eliminating many bugs.
In [5]:
m_train = train_set_x_orig.shape[0]
m_test = test_set_x_orig.shape[0]
num_px = train_set_x_orig.shape[1]
print ("Dataset dimensions:")
print ("Number of training examples: m_train = " + str(m_train))
print ("Number of testing examples: m_test = " + str(m_test))
print ("Height/Width of each image: num_px = " + str(num_px))
print ("Each image is of size: (" + str(num_px) + ", " + str(num_px) + ", 3)")
print ("train_set_x shape: " + str(train_set_x_orig.shape))
print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x shape: " + str(test_set_x_orig.shape))
print ("test_set_y shape: " + str(test_set_y.shape))
For convenience, we now reshape images of shape (num_px, num_px, 3) in a numpy-array of shape (num_px $*$ num_px $*$ 3, 1). After this, our training (and test) dataset is a numpy-array where each column represents a flattened image. There should be m_train (respectively m_test) columns.
A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b$*$c$*$d, a) is to use:
X_flatten = X.reshape(X.shape[0], -1).T # X.T is the transpose of X
In [6]:
# Reshape the training and test examples
train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0], -1).T
test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0], -1).T
print ("train_set_x_flatten shape: " + str(train_set_x_flatten.shape))
print ("train_set_y shape: " + str(train_set_y.shape))
print ("test_set_x_flatten shape: " + str(test_set_x_flatten.shape))
print ("test_set_y shape: " + str(test_set_y.shape))
print ("sanity check after reshaping: " + str(train_set_x_flatten[0:5,0]))
To represent color images, the red, green and blue channels (RGB) must be specified for each pixel, and so the pixel value is actually a vector of three numbers ranging from 0 to 255.
One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you substract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array. But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).
Let's standardize our dataset.
In [7]:
train_set_x = train_set_x_flatten/255.
test_set_x = test_set_x_flatten/255.
It's time to design a simple algorithm to distinguish cat images from non-cat images.
The input will be the image transformed into a vector above, normlaised and flattened. The activation function is the sigmoid. If the output is more than 0.5 then it is considered a cat, otherwise not.
Mathematical expression of the algorithm:
For one example $x^{(i)}$: $$z^{(i)} = w^T x^{(i)} + b $$ $$y_{hat}^{(i)} = a^{(i)} = sigmoid(z^{(i)})$$ $$ \mathcal{L}(a^{(i)}, y^{(i)}) = - y^{(i)} \log(a^{(i)}) - (1-y^{(i)} ) \log(1-a^{(i)})$$
The cost is then computed by summing over all training examples: $$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})$$
Will use the LogisticRegression module from sklearn.
Key steps:
In [8]:
from sklearn.linear_model import LogisticRegression
In [9]:
lr = LogisticRegression(C=1000.0, random_state=0)
In [10]:
lr.fit(train_set_x.T, train_set_y.T.ravel())
Out[10]:
These are the weights that the algorithm fitted to the model:
In [11]:
lr.coef_.shape
Out[11]:
In [12]:
lr.coef_
Out[12]:
In [13]:
lr.intercept_
Out[13]:
The model fitted is therefore: $$-0.001 + 0.056*x_{1} -0.114*x_{2} + ... + 0.154*x_{12288}$$
In [14]:
Y_prediction = lr.predict(test_set_x.T)
Y_prediction.shape
Out[14]:
In [15]:
print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction - test_set_y)) * 100))
72% of the test photos have been predicted correctly.
We can also see how many and which images have been misclassified:
In [16]:
def print_mislabeled_images(classes, X, y, p):
"""
Plots images where predictions and truth were different.
X -- dataset
y -- true labels
p -- predictions
"""
a = p + y
mislabeled_indices = np.asarray(np.where(a == 1))
plt.rcParams['figure.figsize'] = (40.0, 40.0) # set default size of plots
num_images = len(mislabeled_indices[0])
for i in range(num_images):
index = mislabeled_indices[1][i]
plt.subplot(2, num_images, i + 1)
plt.imshow(X[:,index].reshape(64,64,3), interpolation='nearest')
plt.axis('off')
t1 = classes[int(p[index])].decode("utf-8")
t2 = classes[y[0,index]].decode("utf-8")
plt.title("Prediction: " + t1 + " \n Class: " + t2)
return num_images
In [17]:
misses = print_mislabeled_images(classes, test_set_x, test_set_y, Y_prediction)
In [18]:
print("Number of misses: {0}".format(misses))
You can use your own image and see the output of the model. To do that:
I will try with a cat image modified to have bunny ears!
In [19]:
import scipy
#from PIL import Image
from scipy import ndimage
In [21]:
my_image_filename = "Bunny+cat.png" # change this to the name of your image file
# We preprocess the image to fit the LR algorithm.
fname = "../images/" + my_image_filename
image = np.array(scipy.ndimage.imread(fname, flatten=False))
my_image = scipy.misc.imresize(image, size=(num_px,num_px)).reshape((1, num_px*num_px*3)).T
In [22]:
my_predicted_image = lr.predict(my_image.T)
plt.imshow(image)
print("y = " + str(np.squeeze(my_predicted_image)) + ", the algorithm predicts a \"" + classes[int(np.squeeze(my_predicted_image)),].decode("utf-8") + "\" picture.")
hmmm ... ok, the model seems easy to be tricked ...
In [23]:
# Example of a picture
index = 10
plt.imshow(train_set_x_orig[index])
print ("y = " + str(train_set_y[0,index]) + ". It's a " + classes[train_set_y[0,index]].decode("utf-8") + " picture.")
The main steps for building a Neural Network are:
Each layer has a different number of units.
Defining the architecture of a neural network depends on the training data and the type of classification to perform and it is an art by itself. Defining and re-fining the number of layers, units and all other parameters is called hyperparameters tuning and we will see it in details in a next notebook.
For the moment we hard-code an initial set of parameters.
Detailed Architecture:
In [26]:
### CONSTANTS DEFINING THE MODEL ####
n_x = train_set_x_flatten.shape[0] # size of input layer
n_y = 1 # size of output layer, will be 0 or 1
# we define a neural network with total 5 layers, x, y and 3 hidden:
# the first hidden has 20 units, second has 7 units and third has 5
nn_layers = [n_x, 20, 7, 5, n_y] # length is 5 (layers)
In [27]:
nn_layers
Out[27]:
There are two types of parameters to initialize in a neural network:
The weight matrix is initialised with random values while the bias vector as a vector of zeros.
In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing and the network is no more powerful than a linear classifier such as logistic regression.
To break symmetry, we initialise the weights randomly. Following random initialisation, each neuron can then proceed to learn a different function of its inputs.
Of course, different initializations lead to different results and poor initialisation can slow down the optimisation algorithm.
One good practice is not to initialise to values that are too large, instead what bring good results are the so-called Xavier or the He (for ReLU activation) initialisations.
layer_dims
. For example, a model with two inputs, one hidden layer with 4 hidden units and an output layer with 1 output unit would have dimensions equal to [2,4,1]. Thus means W1
's shape is (4,2), b1
is (4,1), W2
is (1,4) and b2
is (1,1).
In [28]:
# FUNCTION: initialize_parameters
def initialise_parameters(layer_dims):
"""
Arguments:
layer_dims -- python array (list) containing the dimensions of each layer in our network
Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
bl -- bias vector of shape (layer_dims[l], 1)
"""
parameters = {}
L = len(layer_dims) # number of layers in the network
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) / np.sqrt(layer_dims[l-1])
parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
# unit tests
assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))
return parameters
How would look like that example network with 3 layers (of 2, 4 and 1 units) initialised?
In [29]:
np.random.seed(3)
parameters = initialise_parameters([2,4,1])
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
Now that we have initialized our parameters, we will do the forward propagation module.
We will implement some helper functions and then put all together:
In [30]:
# FUNCTION: linear_forward
def linear_forward(A, W, b):
"""
Implement the linear part of a layer's forward propagation.
Arguments:
A -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)
Returns:
Z -- the input of the activation function, also called pre-activation parameter
cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently
"""
Z = np.dot(W, A) + b
assert(Z.shape == (W.shape[0], A.shape[1]))
cache = (A, W, b)
return Z, cache
A quick check:
In [31]:
def linear_forward_test_case():
np.random.seed(1)
A = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)
Z, linear_cache = linear_forward(A, W, b)
assert(round(Z[0][0], 5) == 3.26295)
assert(round(Z[0][1], 5) == -1.23430)
return "OK"
In [32]:
print(linear_forward_test_case())
In this notebook, we will use two activation functions:
Sigmoid: $\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$.
This function returns two items: the activation value "a
" and a "cache
" that contains "Z
" (it's what we will feed in to the corresponding backward function). To use it you could just call:
A, activation_cache = sigmoid(Z)
ReLU: The mathematical formula for ReLu is $A = RELU(Z) = max(0, Z)$.
This function returns two items: the activation value "A
" and a "cache
" that contains "Z
" (it's what we will feed in to the corresponding backward function). To use it you could just call:
A, activation_cache = relu(Z)
In [33]:
# FUNCTION: sigmoid
def sigmoid(Z):
"""
Implements the sigmoid activation in numpy
Arguments:
Z -- numpy array of any shape
Returns:
A -- output of sigmoid(z), same shape as Z
cache -- returns Z as well, useful during backpropagation
"""
A = 1/(1+np.exp(-Z))
cache = Z
return A, cache
A quick black-box test for the sigmoid:
In [34]:
def sigmoid_test_case():
result, cache = sigmoid(np.array([0,2]))
assert(round(result[0], 5) == 0.5)
assert(round(result[1], 5) == 0.8808)
return "OK"
In [35]:
sigmoid_test_case()
Out[35]:
In [36]:
def relu(Z):
"""
Implement the RELU function.
Arguments:
Z -- Output of the linear layer, of any shape
Returns:
A -- Post-activation parameter, of the same shape as Z
cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
"""
A = np.maximum(0,Z)
assert(A.shape == Z.shape)
cache = Z
return A, cache
Next: Implement the forward propagation of the LINEAR->ACTIVATION layer.
Mathematical relation is: $A^{[l]} = g(Z^{[l]}) = g(W^{[l]}A^{[l-1]} +b^{[l]})$ where the activation "g" can be sigmoid() or relu(). Use linear_forward() and the correct activation function.
In [37]:
# FUNCTION: linear_activation_forward
def linear_activation_forward(A_prev, W, b, activation):
"""
Implement the forward propagation for the LINEAR->ACTIVATION layer
Arguments:
A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
Returns:
A -- the output of the activation function, also called the post-activation value
cache -- a python dictionary containing "linear_cache" and "activation_cache";
stored for computing the backward pass efficiently
"""
Z, linear_cache = linear_forward(A_prev, W, b)
if activation == "sigmoid":
A, activation_cache = sigmoid(Z)
elif activation == "relu":
A, activation_cache = relu(Z)
assert (A.shape == (W.shape[0], A_prev.shape[1]))
cache = (linear_cache, activation_cache)
return A, cache
As usual, a quick test:
In [38]:
def linear_activation_forward_test_case():
np.random.seed(2)
A_prev = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)
A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "sigmoid")
assert(round(A[0][0], 5) == 0.9689)
assert(round(A[0][1], 5) == 0.11013)
A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "relu")
assert(round(A[0][0], 5) == 3.43896)
assert(round(A[0][1], 5) == 0.0)
return "OK"
In [39]:
linear_activation_forward_test_case()
Out[39]:
Note: In deep learning, the "[LINEAR->ACTIVATION]" computation is counted as a single layer in the neural network, not two layers.
In [40]:
# FUNCTION: L_model_forward
def L_model_forward(X, parameters):
"""
Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
Arguments:
X -- data, numpy array of shape (input size, number of examples)
parameters -- output of initialize_parameters_deep()
Returns:
AL -- last post-activation value
caches -- list of caches containing:
every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
the cache of linear_sigmoid_forward() (there is one, indexed L-1)
"""
caches = []
A = X
L = len(parameters) // 2 # number of layers in the neural network
# Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
for l in range(1, L):
A_prev = A
w_l = parameters['W' + str(l)]
b_l = parameters['b' + str(l)]
A, cache = linear_activation_forward(A_prev, w_l, b_l, activation = "relu")
caches.append(cache)
# Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
w_L = parameters['W' + str(L)]
b_L = parameters['b' + str(L)]
Yhat, cache = linear_activation_forward(A, w_L, b_L, activation = "sigmoid")
caches.append(cache)
assert(Yhat.shape == (1,X.shape[1]))
return Yhat, caches
In [41]:
def L_model_forward_test_case():
np.random.seed(6)
X = np.random.randn(5,4)
W1 = np.random.randn(4,5)
b1 = np.random.randn(4,1)
W2 = np.random.randn(3,4)
b2 = np.random.randn(3,1)
W3 = np.random.randn(1,3)
b3 = np.random.randn(1,1)
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2,
"W3": W3,
"b3": b3}
AL, caches = L_model_forward(X, parameters)
np.testing.assert_array_almost_equal(AL, [[0.03922, 0.704989, 0.19734, 0.04728]], decimal=5)
assert(len(caches) == 3)
return "OK"
In [42]:
L_model_forward_test_case()
Out[42]:
Great! Now we have a full forward propagation that takes the input X and outputs a row vector $A^{[L]}$ containing our predictions. It also records all intermediate values in "caches". Using $A^{[L]}$, we can compute the cost of our predictions.
Now we need to compute the cost, because we want to check if our model is actually learning.
Next: Compute the cross-entropy cost $J$, using the following formula: $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$$
In [43]:
# FUNCTION: compute_cost
def compute_cost(Yhat, Y):
"""
Implement the cross-entropy cost function
Arguments:
Yhat -- probability vector corresponding to the label predictions, shape (1, number of examples)
Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)
Returns:
cost -- cross-entropy cost
"""
m = Y.shape[1]
# Compute loss from AL and Y.
logprobs = np.dot(Y, np.log(Yhat).T) + np.dot((1-Y), np.log(1-Yhat).T)
cost = (-1./m) * logprobs
cost = np.squeeze(cost) # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert(cost.shape == ())
return cost
In [44]:
def compute_cost_test_case():
y = np.asarray([[1, 1, 1]])
al = np.array([[.8,.9,0.4]])
cost = compute_cost(al, y)
#assert(cost == 0.41493)
np.testing.assert_approx_equal(cost, 0.41493, significant=5)
return "OK"
In [45]:
compute_cost_test_case()
Out[45]:
Now we will implement the backward function for the whole network.
Just like with forward propagation, we will implement helper functions for backpropagation. Remember that back propagation is used to calculate the gradient of the loss function with respect to the parameters. You can see more about backpropagation in my previous post:
Now, similar to forward propagation, we are going to build the backward propagation in three steps:
Reminder:
The purple blocks represent the forward propagation, and the red blocks represent the backward propagation.
For layer $l$, the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).
Now we need to compute the three derivatives $(dW^{[l]}, db^{[l]}, dA^{[l]})$, using as input a known derivate $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$. :
$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} $$$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} $$
In [46]:
# FUNCTION: linear_backward
def linear_backward(dZ, cache):
"""
Implement the linear portion of backward propagation for a single layer (layer l)
Arguments:
dZ -- Gradient of the cost with respect to the linear output (of current layer l)
cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer
Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""
A_prev, W, b = cache
m = A_prev.shape[1]
dW = (1./m) * np.dot(dZ, A_prev.T)
db = (1./m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)
return dA_prev, dW, db
In [47]:
def linear_backward_test_case():
np.random.seed(1)
dZ = np.random.randn(1,2)
A = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)
linear_cache = (A, W, b)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
np.testing.assert_array_almost_equal(dA_prev, [[ 0.51823, -0.19517], [-0.40506, 0.15255], [ 2.37496, -0.89445]], decimal=5)
np.testing.assert_array_almost_equal(dW, [[-0.10077, 1.40685, 1.64992]], decimal=5)
np.testing.assert_approx_equal(db, 0.50629, significant=5)
return "OK"
In [48]:
linear_backward_test_case()
Out[48]:
Next, we will create a function that merges the two helper functions: linear_backward
and the backward step for the activation:
sigmoid_backward
: Implements the backward propagation for SIGMOID unit.
relu_backward
: Implements the backward propagation for RELU unit.
If $g(.)$ is the activation function,
sigmoid_backward
and relu_backward
compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) \tag{11}$$.
In [49]:
def sigmoid_backward(dA, cache):
"""
Implement the backward propagation for a single SIGMOID unit.
Arguments:
dA -- post-activation gradient, of any shape
cache -- 'Z' where we store for computing backward propagation efficiently
Returns:
dZ -- Gradient of the cost with respect to Z
"""
Z = cache
s = 1/(1+np.exp(-Z))
dZ = dA * s * (1-s)
assert (dZ.shape == Z.shape)
return dZ
In [50]:
def relu_backward(dA, cache):
"""
Implement the backward propagation for a single RELU unit.
Arguments:
dA -- post-activation gradient, of any shape
cache -- 'Z' where we store for computing backward propagation efficiently
Returns:
dZ -- Gradient of the cost with respect to Z
"""
Z = cache
dZ = np.array(dA, copy=True) # just converting dz to a correct object.
# When z <= 0, you should set dz to 0 as well.
dZ[Z <= 0] = 0
assert (dZ.shape == Z.shape)
return dZ
In [51]:
# FUNCTION: linear_activation_backward
def linear_activation_backward(dA, cache, activation):
"""
Implement the backward propagation for the LINEAR->ACTIVATION layer.
Arguments:
dA -- post-activation gradient for current layer l
cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""
linear_cache, activation_cache = cache
if activation == "relu":
dZ = relu_backward(dA, activation_cache)
elif activation == "sigmoid":
dZ = sigmoid_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
return dA_prev, dW, db
In [52]:
def linear_activation_backward_test_case():
np.random.seed(2)
dA = np.random.randn(1,2)
A = np.random.randn(3,2)
W = np.random.randn(1,3)
b = np.random.randn(1,1)
Z = np.random.randn(1,2)
linear_cache = (A, W, b)
activation_cache = Z
linear_activation_cache = (linear_cache, activation_cache)
dA_prev, dW, db = linear_activation_backward(dA, linear_activation_cache, activation = "sigmoid")
np.testing.assert_array_almost_equal(dA_prev, [[0.11018, 0.01105], [ 0.09467, 0.00949], [-0.05743, -0.00576]], decimal=5)
np.testing.assert_approx_equal(db, -0.057296, significant=5)
np.testing.assert_array_almost_equal(dW, [[ 0.10267, 0.09778, -0.01968]] , decimal=5)
dA_prev, dW, db = linear_activation_backward(dA, linear_activation_cache, activation = "relu")
np.testing.assert_array_almost_equal(dA_prev, [[ 0.4409, 0.], [ 0.37884, 0. ], [-0.22982, 0. ]] , decimal=5)
np.testing.assert_array_almost_equal(dW, [[ 0.44514, 0.37371, -0.10479]] , decimal=5)
np.testing.assert_approx_equal(db, -0.20838, significant=5)
return "OK"
In [53]:
linear_activation_backward_test_case()
Out[53]:
Recall that when we implemented the L_model_forward
function, at each iteration, we stored a cache which contains (X,W,b, and z). In the back propagation module, we will use those variables to compute the gradients. Therefore, in the L_model_backward
function, we will iterate through all the hidden layers backward, starting from layer $L$. On each step, we will use the cached values for layer $l$ to backpropagate through layer $l$.
In [54]:
# FUNCTION: L_model_backward
def L_model_backward(Yhat, Y, caches):
"""
Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
Arguments:
AL -- probability vector, output of the forward propagation (L_model_forward())
Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
caches -- list of caches containing:
every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
Returns:
grads -- A dictionary with the gradients
grads["dA" + str(l)] = ...
grads["dW" + str(l)] = ...
grads["db" + str(l)] = ...
"""
grads = {}
L = len(caches) # the number of layers
m = Yhat.shape[1]
Y = Y.reshape(Yhat.shape) # after this line, Y is the same shape as AL
# Initializing the backpropagation
dAL = - (np.divide(Y, Yhat) - np.divide(1 - Y, 1 - Yhat)) # derivative of cost with respect to AL
# Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
current_cache = caches[L-1]
grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid")
for l in reversed(range(L-1)):
# lth layer: (RELU -> LINEAR) gradients.
# Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+2)], current_cache, activation = "relu")
grads["dA" + str(l + 1)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
return grads
In [55]:
def L_model_backward_test_case():
np.random.seed(3)
AL = np.random.randn(1, 2)
Y = np.array([[1, 0]])
A1 = np.random.randn(4,2)
W1 = np.random.randn(3,4)
b1 = np.random.randn(3,1)
Z1 = np.random.randn(3,2)
linear_cache_activation_1 = ((A1, W1, b1), Z1)
A2 = np.random.randn(3,2)
W2 = np.random.randn(1,3)
b2 = np.random.randn(1,1)
Z2 = np.random.randn(1,2)
linear_cache_activation_2 = ((A2, W2, b2), Z2)
caches = (linear_cache_activation_1, linear_cache_activation_2)
grads = L_model_backward(AL, Y, caches)
np.testing.assert_array_almost_equal(grads["dW1"], [[ 0.4101, 0.078072, 0.13798, 0.10502], [ 0., 0., 0., 0. ], [0.05284, 0.01006, 0.01778, 0.01353]] , decimal=5)
np.testing.assert_array_almost_equal(grads["db1"], [[-0.22007], [ 0.], [-0.02835]], decimal=5)
np.testing.assert_array_almost_equal(grads["dA2"], [[ 0.12913, -0.44014], [-0.14176,0.48317], [0.01664, -0.05671]], decimal=5)
return "OK"
In [56]:
L_model_backward_test_case()
Out[56]:
In this section we will update the parameters of the model, using gradient descent:
$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{16}$$$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{17}$$where $\alpha$ is the learning rate. After computing the updated parameters, we store them in the parameters dictionary.
In [57]:
# FUNCTION: update_parameters
def update_parameters(parameters, grads, learning_rate):
"""
Update parameters using gradient descent
Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients, output of L_model_backward
Returns:
parameters -- python dictionary containing your updated parameters
parameters["W" + str(l)] = ...
parameters["b" + str(l)] = ...
"""
L = len(parameters) // 2 # number of layers in the neural network
# Update rule for each parameter. Use a for loop.
for l in range(L):
parameters["W"+str(l+1)] = parameters["W"+str(l+1)] - learning_rate * grads["dW" + str(l+1)]
parameters["b"+str(l+1)] = parameters["b"+str(l+1)] - learning_rate * grads["db" + str(l+1)]
return parameters
In [58]:
def update_parameters_test_case():
np.random.seed(2)
W1 = np.random.randn(3,4)
b1 = np.random.randn(3,1)
W2 = np.random.randn(1,3)
b2 = np.random.randn(1,1)
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
np.random.seed(3)
dW1 = np.random.randn(3,4)
db1 = np.random.randn(3,1)
dW2 = np.random.randn(1,3)
db2 = np.random.randn(1,1)
grads = {"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2}
parameters = update_parameters(parameters, grads, 0.1)
np.testing.assert_array_almost_equal(parameters["W1"], [[-0.59562, -0.09992, -2.14585, 1.82662], [-1.7657, -0.80627, 0.51116, -1.18259,], [-1.05357, -0.86129, 0.68284, 2.20375]] , decimal=5)
np.testing.assert_array_almost_equal(parameters["b1"], [[-0.04659], [-1.28888], [ 0.53405]], decimal=5)
np.testing.assert_array_almost_equal(parameters["W2"], [[-0.55569, 0.0354, 1.32964]], decimal=5)
np.testing.assert_array_almost_equal(parameters["b2"], [[-0.8461]], decimal=5)
return "OK"
In [59]:
update_parameters_test_case()
Out[59]:
In [60]:
nn_layers
Out[60]:
In [61]:
# FUNCTION: L_layer_model
def L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):
"""
Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID.
Arguments:
X -- data, numpy array of shape (number of examples, num_px * num_px * 3)
Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples)
layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
learning_rate -- learning rate of the gradient descent update rule
num_iterations -- number of iterations of the optimization loop
print_cost -- if True, it prints the cost every 100 steps
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
costs = [] # keep track of cost
# Parameters initialization.
parameters = initialise_parameters(layers_dims)
# Loop (gradient descent)
for i in range(0, num_iterations):
# Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID.
AL, caches = L_model_forward(X, parameters)
# Compute cost.
cost = compute_cost(AL, Y)
# Backward propagation.
grads = L_model_backward(AL, Y, caches)
# Update parameters.
parameters = update_parameters(parameters, grads, learning_rate)
# Print the cost every 100 training example
if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % 100 == 0:
costs.append(cost)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
return parameters
You will now train the model as a 5-layer neural network. Run the cell below to train your model. The cost should decrease on every iteration. It may take up to 5 minutes to run 2500 iterations. If the cost is not decreasing, then you can click on the square (⬛) on the upper bar of the notebook to stop the cell and try to find your error.
In [62]:
np.random.seed(1)
In [63]:
fit_params = L_layer_model(train_set_x, train_set_y, nn_layers, num_iterations = 2500, print_cost = True)
In [64]:
def predict(X, y, parameters):
"""
This function is used to predict the results of a L-layer neural network.
Arguments:
X -- data set of examples you would like to label
parameters -- parameters of the trained model
Returns:
p -- predictions for the given dataset X
"""
m = X.shape[1]
n = len(parameters) // 2 # number of layers in the neural network
p = np.zeros((1,m))
# Forward propagation
probas, caches = L_model_forward(X, parameters)
# convert probs to 0/1 predictions
for i in range(0, probas.shape[1]):
if probas[0,i] > 0.5:
p[0,i] = 1
else:
p[0,i] = 0
# print results
print("Accuracy: " + str(np.sum((p == y)/m)))
return p
In [65]:
pred_train = predict(train_set_x, train_set_y, fit_params)
In [66]:
pred_test = predict(test_set_x, test_set_y, fit_params)
Congrats! It seems that this 5-layer neural network has better performance (80%) than the logistic regression model (72%) on the same test set.
This is good performance for this task.
Even higher accuracy could be obtained, by systematically searching for better hyperparameters (learning_rate, layers_dims, num_iterations) and other techniques such as regularisation, that we can see in the next notebook.
With logistic regression we had 14 misses in the test set, out of 50.
Let's check how many we have now and which ones are.
In [67]:
misses = print_mislabeled_images(classes, test_set_x, test_set_y, pred_test[0])
In [68]:
print("Number of misses: {0}".format(misses))
In [71]:
fname
Out[71]:
In [72]:
# Remember that my_image contains the processed image, previously prepared
my_predicted_image = predict(my_image, [0], fit_params)
plt.imshow(image)
print("y = " + str(np.squeeze(my_predicted_image)) + ", the algorithm predicts a \"" + classes[int(np.squeeze(my_predicted_image)),].decode("utf-8") + "\" picture.")
Yes, now it classifies it as a cat ! (which I think is correct: it's just a cat with long ears ...)