by Alejandro Correa Bahnsen and Jesus Solano
version 1.4, May 2019
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Valerio Maggio, Fondazione Bruno Kessler
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.
These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics.
Google DeepMind: Ground-breaking AlphaGo masters the game of Go
Google Assistant: An artificial intelligence-powered virtual assistant
Research at NVIDIA: The First Interactive AI Rendered Virtual World
In machine learning and cognitive science, an artificial neural network (ANN) is a network inspired by biological neural networks which are used to estimate or approximate functions that can depend on a large number of inputs that are generally unknown
An ANN is built from nodes (neurons) stacked in layers between the feature vector and the target vector.
A node in a neural network is built from Weights and Activation function
An early version of ANN built from one node was called the Perceptron
The Perceptron is an algorithm for supervised learning of binary classifiers. functions that can decide whether an input (represented by a vector of numbers) belongs to one class or another.
Much like logistic regression, the weights in a neural net are being multiplied by the input vertor summed up and feeded into the activation function's input.
A Perceptron Network can be designed to have multiple layers, leading to the Multi-Layer Perceptron (aka MLP
)
(Source: Python Machine Learning, S. Raschka)
In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the opposite direction of the gradient $ \nabla J(w)$.
In order to fin the optimal weights of the model, we optimized an objective function (e.g. the Sum of Squared Errors (SSE)) cost function $J(w)$.
Furthermore, we multiply the gradient by a factor, the learning rate $\eta$ , which we choose carefully to balance the speed of learning against the risk of overshooting the global minimum of the cost function.
In gradient descent optimization, we update all the weights simultaneously after each epoch, and we define the partial derivative for each weight $w_j$ in the weight vector $w$ as follows:
$$ \frac{\partial}{\partial w_j} J(w) = \sum_{i} ( y^{(i)} - a^{(i)} ) x^{(i)}_j $$Note: The superscript $(i)$ refers to the i-th sample. The subscript $j$ refers to the j-th dimension/feature
Here $y^{(i)}$ is the target class label of a particular sample $x^{(i)}$ , and $a^{(i)}$ is the activation of the neuron
(which is a linear function in the special case of Perceptron).
We define the activation function $\phi(\cdot)$ as follows:
$$ \phi(z) = z = a = \sum_{j} w_j x_j = \mathbf{w}^T \mathbf{x} $$While we used the activation $\phi(z)$ to compute the gradient update, we may use a threshold function (Heaviside function) to squash the continuous-valued output into binary class labels for prediction:
$$ \hat{y} = \begin{cases} 1 & \text{if } \phi(z) \geq 0 \\ 0 & \text{otherwise} \end{cases} $$We will build the neural networks from first principles. We will create a very simple model and understand how it works. We will also be implementing backpropagation algorithm.
Please note that this code is not optimized and not to be used in production.
This is for instructive purpose - for us to understand how ANN works.
Libraries like theano
have highly optimized code.
If you want a sneak peek of alternate (production ready) implementation of Perceptron for instance try:
from sklearn.linear_model import Perceptron
(Source: Python Machine Learning, S. Raschka)
Now we will see how to connect multiple single neurons to a multi-layer feedforward neural network; this special type of network is also called a multi-layer perceptron (MLP).
The figure shows the concept of an MLP consisting of three layers: one input layer, one hidden layer, and one output layer.
The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer, respectively.
If such a network has more than one hidden layer, we also call it a deep artificial neural network.
we denote the ith
activation unit in the lth
layer as $a_i^{(l)}$ , and the activation units $a_0^{(1)}$ and
$a_0^{(2)}$ are the bias units, respectively, which we set equal to $1$.
The activation of the units in the input layer is just its input plus the bias unit:
Note: $x_j^{(i)}$ refers to the jth feature/dimension of the ith sample
The terminology around the indices (subscripts and superscripts) may look a little bit confusing at first.
You may wonder why we wrote $w_{j,k}^{(l)}$ and not $w_{k,j}^{(l)}$ to refer to
the weight coefficient that connects the kth unit in layer $l$ to the jth unit in layer $l+1$.
What may seem a little bit quirky at first will make much more sense later when we vectorize the neural network representation.
For example, we will summarize the weights that connect the input and hidden layer by a matrix $$ W^{(1)} \in \mathbb{R}^{h×[m+1]}$$
where $h$ is the number of hidden units and $m + 1$ is the number of hidden units plus bias unit.
(Source: Python Machine Learning, S. Raschka)
Starting at the input layer, we forward propagate the patterns of the training data through the network to generate an output.
Based on the network's output, we calculate the error that we want to minimize using a cost function that we will describe later.
We backpropagate the error, find its derivative with respect to each weight in the network, and update the model.
(Source: Python Machine Learning, S. Raschka)
(Source: Python Machine Learning, S. Raschka)
(Source: Python Machine Learning, S. Raschka)
The weights of each neuron are learned by gradient descent, where each neuron's error is derived with respect to it's weight.
(Source: Python Machine Learning, S. Raschka)
Optimization is done for each layer with respect to the previous layer in a technique known as BackPropagation.
The weights of each neuron are learned by gradient descent, where each neuron's error is derived with respect to it's weight. The rule to update parameters is $ \theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$ where $\alpha$ is the learning rate and $\theta$ represents a parameter.
The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.
(The following code is inspired from these terrific notebooks)
In [1]:
# Import the required packages
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy
In [2]:
# Package imports
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
You are going to train a Neural Network with a single hidden layer.
Here is our model:
Mathematically:
For one example $x^{(i)}$: $$z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1]}\tag{1}$$ $$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$$ $$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2]}\tag{3}$$ $$\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}$$ $$y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}$$
Given the predictions on all the examples, you can also compute the cost $J$ as follows: $$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large \right) \small \tag{6}$$
Reminder: The general methodology to build a Neural Network is to:
1. Define the neural network structure ( # of input units, # of hidden units, etc).
2. Initialize the model's parameters
3. Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)
You often build helper functions to compute steps 1-3 and then merge them into one function we call nn_model()
. Once you've built nn_model()
and learnt the right parameters, you can make predictions on new data.
In [3]:
def layer_sizes(X, Y, hidden_neurons):
"""
Arguments:
X -- input dataset of shape (input size, number of examples)
Y -- labels of shape (output size, number of examples)
Returns:
n_x -- the size of the input layer
n_h -- the size of the hidden layer
n_y -- the size of the output layer
"""
n_x = X.shape[0] # size of input layer
n_h = hidden_neurons
n_y = Y.shape[0] # size of output layer
return (n_x, n_h, n_y)
Exercise: Implement the function initialize_parameters()
.
Instructions:
np.random.randn(a,b) * 0.01
to randomly initialize a matrix of shape (a,b).np.zeros((a,b))
to initialize a matrix of shape (a,b) with zeros.
In [4]:
# Solved Exercise: initialize_parameters
def initialize_parameters(n_x, n_h, n_y):
"""
Argument:
n_x -- size of the input layer
n_h -- size of the hidden layer
n_y -- size of the output layer
Returns:
params -- python dictionary containing your parameters:
W1 -- weight matrix of shape (n_h, n_x)
b1 -- bias vector of shape (n_h, 1)
W2 -- weight matrix of shape (n_y, n_h)
b2 -- bias vector of shape (n_y, 1)
"""
np.random.seed(2) # we set up a seed so that your output matches ours although the initialization is random.
W1 = np.random.randn(n_h,n_x) * 0.01
b1 = np.zeros(shape=(n_h,1))
W2 = np.random.randn(n_y,n_h) * 0.01
b2 = np.zeros(shape=(n_y,1))
assert (W1.shape == (n_h, n_x))
assert (b1.shape == (n_h, 1))
assert (W2.shape == (n_y, n_h))
assert (b2.shape == (n_y, 1))
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
return parameters
Question: Implement forward_propagation()
.
Instructions:
sigmoid()
. It is built-in (imported) in the notebook.np.tanh()
. It is part of the numpy library.initialize_parameters()
) by using parameters[".."]
.cache
". The cache
will be given as an input to the backpropagation function.
In [5]:
def forward_propagation(X, parameters):
"""
Argument:
X -- input data of size (n_x, m)
parameters -- python dictionary containing your parameters (output of initialization function)
Returns:
A2 -- The sigmoid output of the second activation
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
"""
# Retrieve each parameter from the dictionary "parameters"
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
# Implement Forward Propagation to calculate A2 (probabilities)
Z1 = np.dot(W1,X)+b1
A1 = np.tanh(Z1)
Z2 = np.dot(W2,A1)+b2
A2 = sigmoid(Z2)
assert(A2.shape == (1, X.shape[1]))
cache = {"Z1": Z1,
"A1": A1,
"Z2": Z2,
"A2": A2}
return A2, cache
def sigmoid(x):
"""
Compute the sigmoid of x
Arguments:
x -- A scalar or numpy array of any size.
Return:
s -- sigmoid(x)
"""
s = 1/(1+np.exp(-x))
return s
Now that you have computed $A^{[2]}$ (in the Python variable "A2
"), which contains $a^{[2](i)}$ for every example, you can compute the cost function as follows:
Cost function: Implement compute_cost()
to compute the value of the cost $J$.
Instructions:
logprobs = np.multiply(np.log(A2),Y)
cost = - np.sum(logprobs) # no need to use a for loop!
(you can use either np.multiply()
and then np.sum()
or directly np.dot()
).
In [6]:
def compute_cost(A2, Y, parameters):
"""
Computes the cross-entropy cost given in equation (13)
Arguments:
A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
Y -- "true" labels vector of shape (1, number of examples)
parameters -- python dictionary containing your parameters W1, b1, W2 and b2
Returns:
cost -- cross-entropy cost given equation (13)
"""
m = Y.shape[1] # number of example
# Compute the cross-entropy cost
logprobs = np.multiply(Y,np.log(A2)) + np.multiply(1-Y,np.log(1-A2))
cost = -1/m * np.sum(logprobs)
cost = np.squeeze(cost) # makes sure cost is the dimension we expect.
# E.g., turns [[17]] into 17
assert(isinstance(cost, float))
return cost
Using the cache computed during forward propagation, you can now implement backward propagation.
Backward Propagation: Implement the function backward_propagation()
.
Instructions: Backpropagation is usually the hardest (most mathematical) part in deep learning. To help you, here again is the slide from the lecture on backpropagation. You'll want to use the six equations on the right of this slide, since you are building a vectorized implementation.
(1 - np.power(A1, 2))
.
In [7]:
def backward_propagation(parameters, cache, X, Y):
"""
Implement the backward propagation using the instructions above.
Arguments:
parameters -- python dictionary containing our parameters
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
X -- input data of shape (2, number of examples)
Y -- "true" labels vector of shape (1, number of examples)
Returns:
grads -- python dictionary containing your gradients with respect to different parameters
"""
m = X.shape[1]
# First, retrieve W1 and W2 from the dictionary "parameters".
W1 = parameters["W1"]
W2 = parameters["W2"]
# Retrieve also A1 and A2 from dictionary "cache".
A1 = cache["A1"]
A2 = cache["A2"]
# Backward propagation: calculate dW1, db1, dW2, db2.
dZ2 = A2 - Y
dW2 = 1/m * np.dot(dZ2,A1.T)
db2 = 1/m*np.sum(dZ2,axis=1,keepdims=True)
dZ1 = np.dot(W2.T,dZ2) * (1 - np.power(A1,2))
dW1 = 1/m* np.dot(dZ1,X.T)
db1 = 1/m*np.sum(dZ1,axis=1,keepdims=True)
grads = {"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2}
return grads
Implement the update rule. Use gradient descent. You have to use (dW1, db1, dW2, db2) in order to update (W1, b1, W2, b2).
General gradient descent rule: $ \theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$ where $\alpha$ is the learning rate and $\theta$ represents a parameter.
Illustration: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.
In [8]:
def update_parameters(parameters, grads, learning_rate = 1.2):
"""
Updates parameters using the gradient descent update rule given above
Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients
Returns:
parameters -- python dictionary containing your updated parameters
"""
# Retrieve each parameter from the dictionary "parameters"
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
# Retrieve each gradient from the dictionary "grads"
dW1 = grads["dW1"]
db1 = grads["db1"]
dW2 = grads["dW2"]
db2 = grads["db2"]
# Update rule for each parameter
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
return parameters
In [9]:
def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
"""
Arguments:
X -- dataset of shape (2, number of examples)
Y -- labels of shape (1, number of examples)
n_h -- size of the hidden layer
num_iterations -- Number of iterations in gradient descent loop
print_cost -- if True, print the cost every 1000 iterations
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
np.random.seed(3)
n_x = layer_sizes(X, Y,n_h)[0]
n_y = layer_sizes(X, Y,n_h)[2]
# Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
parameters = initialize_parameters(n_x,n_h,n_y)
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
# Loop (gradient descent)
for i in range(0, num_iterations):
# Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
A2, cache = forward_propagation(X, parameters)
# Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
cost = compute_cost(A2,Y,parameters)
# Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
grads = backward_propagation(parameters,cache,X,Y)
# Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
parameters = update_parameters(parameters,grads)
# Print the cost every 1000 iterations
if print_cost and i % 1000 == 0:
print ("Cost after iteration %i: %f" %(i, cost))
return parameters
Use your model to predict by building predict(). Use forward propagation to predict results.
Reminder: predictions = $y_{prediction} = \mathbb 1 \text{{activation > 0.5}} = \begin{cases} 1 & \text{if}\ activation > 0.5 \\ 0 & \text{otherwise} \end{cases}$
As an example, if you would like to set the entries of a matrix X to 0 and 1 based on a threshold you would do: X_new = (X > threshold)
In [10]:
def predict(parameters, X):
"""
Using the learned parameters, predicts a class for each example in X
Arguments:
parameters -- python dictionary containing your parameters
X -- input data of size (n_x, m)
Returns
predictions -- vector of predictions of our model (red: 0 / blue: 1)
"""
# Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
A2, cache = forward_propagation(X,parameters)
predictions = A2 > 0.5
return predictions
It is time to run the model and see how it performs on a planar dataset. Run the following code to test your model with a single hidden layer of $n_h$ hidden units.
In [11]:
from sklearn.datasets.samples_generator import make_circles
x_train, y_train = make_circles(n_samples=1000, noise= 0.05, random_state=3)
plt.figure(figsize=(15, 10))
plt.scatter(x_train[:, 0], x_train[:,1], c=y_train, s=40, cmap=plt.cm.Spectral);
In [12]:
def plot_decision_boundary(model, X, y):
plt.figure(figsize=(15,10))
# Set min and max values and give it some padding
x_min, x_max = X[0, :].min() - 1, X[0, :].max() + 1
y_min, y_max = X[1, :].min() - 1, X[1, :].max() + 1
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole grid
Z = model(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(X[0, :], X[1, :], c=y.ravel(), s=80, cmap=plt.cm.Spectral)
In [13]:
# Build a model with a n_h-dimensional hidden layer
parameters = nn_model(x_train.T, np.array([y_train.T]), n_h = 5, num_iterations = 10000, print_cost=True)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), x_train.T, np.array([y_train.T]))
plt.title("Decision Boundary for hidden layer size " + str(4))
Out[13]:
Expected Output:
**Cost after iteration 9000** | 0.218607 |
In [14]:
# Print accuracy
predictions = predict(parameters, x_train.T)
print ('Accuracy: %d' % float((np.dot( np.array([y_train.T]),predictions.T) + np.dot(1- np.array([y_train.T]),1-predictions.T))/float( np.array([y_train.T]).size)*100) + '%')