The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams. (Michael Nielsen in "Neural Networks and Deep Learning", http://neuralnetworksanddeeplearning.com/chap2.html).
Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years. (Christopher Olah, 2016)
We have seen that in order to optimize our models we need to compute the derivative of the loss function with respect to all model paramaters.
The computation of derivatives in computer models is addressed by four main methods:
Automatic differentiation (AD) works by systematically applying the chain rule of differential calculus at the elementary operator level.
Let $ y = f(g(x)) $ our target function. In its basic form, the chain rule states:
$$ \frac{\partial y}{\partial x} = \frac{\partial y}{\partial g} \frac{\partial g}{\partial x} $$or, if there are more than one variable $g_i$ in-between $y$ and $x$ (f.e. if $f$ is a two dimensional function such as $f(g_1(x), g_2(x))$), then:
$$ \frac{\partial y}{\partial x} = \sum_i \frac{\partial y}{\partial g_i} \frac{\partial g_i}{\partial x} $$See https://www.math.hmc.edu/calculus/tutorials/multichainrule/
AD allows the accurate evaluation of derivatives at machine precision, with only a small constant factor of overhead.
In its most basic description, AD relies on the fact that all numerical computations are ultimately compositions of a finite set of elementary operations for which derivatives are known.
For example, let's consider this function:
$$y = f(x_1, x_2) = \ln(x_1) + x_1 x_2 − sin(x_2)$$The evaluation of this expression can be described by using intermediate variables $v_i$ such that:
Let's write the forward evaluation trace, based on elementary operations, at $(x_1, x_2) = (2,5)$:
It is interesting to note that this trace can be represented by a graph that can be automatically derived from $f(x_1, x_2)$.
Given a function made up of several nested function calls, there are several ways to compute its derivative.
For example, given $L(x) = f(g(h(x)))$, the chain rule says that its gradient is:
$$ \frac{\partial L}{\partial x} = \frac{\partial f}{\partial g} \times \frac{\partial g}{\partial h} \times \frac{\partial h}{\partial x}$$If we evaluate this product from right-to-left: $\frac{\partial F}{\partial G} \times (\frac{\partial G}{\partial H} \times \frac{\partial H}{\partial x})$, the same order as the computations themselves were performed, this is called forward-mode differentiation.
For computing the derivative of $f$ with respect to $x_1$ we start by associating with each intermediate variable $v_i$ a derivative: $\partial v_i = \frac{\partial v_i}{\partial x_1}$.
Then we apply the chain rule to each elementary operation:
At the end we have the derivative of $f$ with respect to $x_1$ at $(2,5)$.
It is important to note that this computation can be locally performed at each node $v_i$ of the graph if we:
For example, at node $v_2$:
AD relies on the fact that all numerical computations are ultimately compositions of a finite set of elementary operations for which derivatives are known.
We have seen forward accumulation AD. Forward accumulation is efficient for functions $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ with $n << m$ (only $O(n)$ sweeps are necessary). For cases $n >> m$ a different technique is needed.
Luckily, we can also propagate derivatives backward from a given output. This is reverse accumulation AD. If we evaluate this product from left-to-right: $(\frac{\partial f}{\partial g} \times \frac{\partial g}{\partial h}) \times \frac{\partial h}{\partial x}$, this is called reverse-mode differentiation.
Reverse pass starts at the end (i.e. $\frac{\partial y}{\partial y} = 1$) and propagates backward to all dependencies. In our case, $y$ will correspond to the components of the loss function.
Here we have:
This is a two-stage process. In the first stage the original function code is run forward, populating $v_i$ variables. In the second stage, derivatives are calculated by propagating in reverse, from the outputs to the inputs.
The most important property of reverse accumulation AD is that it is cheaper than forward accumulation AD for funtions with a high number of input variables. In our case, $f : \mathbb{R}^n \rightarrow \mathbb{R}$, only one application of the reverse mode is sufficient to compute the full gradient of the function $\nabla f = \big( \frac{\partial y}{\partial x_1}, \dots ,\frac{\partial y}{\partial x_n} \big)$.
Autograd is a Python module (with only one function) that implements automatic differentiation.
In [ ]:
!pip install autograd
Autograd can automatically differentiate Python and Numpy code:
In [ ]:
import autograd.numpy as np
from autograd import grad
x = np.array([2, 5], dtype=float)
def test(x):
return np.log(x[0]) + x[0]*x[1] - np.sin(x[1])
grad_test = grad(test)
print "({:.2f},{:.2f})".format(grad_test(x)[0],grad_test(x)[1])
Then, logistic regression model fitting
$$ f(x) = \frac{1}{1 + \exp^{-(w_0 + w_1 x)}} $$can be implemented in this way:
In [ ]:
import autograd.numpy as np
from autograd import grad
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def logistic_predictions(weights, inputs):
return sigmoid(np.dot(inputs, weights))
def training_loss(weights, inputs, targets):
preds = logistic_predictions(weights, inputs)
label_probabilities = preds * targets + (1 - preds) * (1 - targets)
return -np.sum(np.log(label_probabilities))
def optimize(inputs, targets, training_loss):
# Optimize weights using gradient descent.
gradient_loss = grad(training_loss)
weights = np.zeros(inputs.shape[1])
print "Initial loss:", training_loss(weights, inputs, targets)
for i in xrange(100):
weights -= gradient_loss(weights, inputs, targets) * 0.01
print "Final loss:", training_loss(weights, inputs, targets)
return weights
# Build a toy dataset.
inputs = np.array([[0.52, 1.12, 0.77],
[0.88, -1.08, 0.15],
[0.52, 0.06, -1.30],
[0.74, -2.49, 1.39]])
targets = np.array([True, True, False, True])
weights = optimize(inputs, targets, training_loss)
print "Weights:", weights
Any complex function that can be decomposed in a set of elementary functions can be derived in an automatic way, at machine precision, by this algorithm!
We no longer need to code complex derivatives to apply SGD!
In [ ]:
%reset
In [ ]:
import numpy as np
#Example dataset
N_samples_per_class = 100
d_dimensions = 2
x = np.vstack((np.random.randn(N_samples_per_class, d_dimensions),
np.random.randn(N_samples_per_class, d_dimensions)
+np.array([5,5])))
y = np.concatenate([-1.0*np.ones(N_samples_per_class),
1.*np.ones(N_samples_per_class)])
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
fig.set_facecolor('#EAEAF2')
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.25)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.5,color='pink')
In [ ]:
import autograd.numpy as np
from autograd import grad
def SVM_predictions(w, inputs):
return np.dot(w[0,:-1],inputs.T)+w[0,-1]
def SVM_training_loss(weights, inputs, targets):
pred = SVM_predictions(weights, inputs)
return np.sum(np.maximum(0,1-targets*pred))/inputs.shape[0]
def optimize(inputs, targets, training_loss):
gradient_loss = grad(training_loss)
weights = np.zeros((1,inputs.shape[1]+1))
print "Initial loss:", training_loss(weights, inputs, targets)
for i in xrange(10000):
weights -= gradient_loss(weights, inputs, targets) * 0.05
if i%1000 == 0:
print " Loss:", training_loss(weights, inputs, targets)
print "Final loss:", training_loss(weights, inputs, targets)
return weights
weights = optimize(x, y, SVM_training_loss)
print "Weights", weights
In [ ]:
delta = 0.1
xx = np.arange(-4.0, 10.0, delta)
yy = np.arange(-4.0, 10.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1)
Z = SVM_predictions(weights,test_data)
fig, ax = plt.subplots(1, 1)
fig.set_facecolor('#EAEAF2')
Z = np.reshape(Z,(xx.shape[0],xx.shape[0]))
plt.contour(XX,YY,Z,[0])
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.25)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.5,color='pink')
In [ ]:
%reset
In [ ]:
import numpy as np
#Example dataset
N_samples_per_class = 100
d_dimensions = 2
x = np.vstack((np.random.randn(N_samples_per_class, d_dimensions),
np.random.randn(N_samples_per_class, d_dimensions)
+np.array([2,2])))
y = np.concatenate([-1.0*np.ones(N_samples_per_class),
1.*np.ones(N_samples_per_class)])
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
fig.set_facecolor('#EAEAF2')
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.25)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.5,color='pink')
In [ ]:
import autograd.numpy as np
from autograd import grad
def SVM_predictions(w, inputs):
return np.dot(w[0,:-1],inputs.T)+w[0,-1]
def SVM_training_loss(weights, inputs, targets):
pred = SVM_predictions(weights, inputs)
return np.sum(np.maximum(0,1-targets*pred))/inputs.shape[0]
def optimize(inputs, targets, training_loss):
gradient_loss = grad(training_loss)
weights = np.zeros((1,inputs.shape[1]+1))
print "Initial loss:", training_loss(weights, inputs, targets)
for i in xrange(10000):
weights -= gradient_loss(weights, inputs, targets) * 0.05
if i%1000 == 0:
print " Loss:", training_loss(weights, inputs, targets)
print "Final loss:", training_loss(weights, inputs, targets)
return weights
weights = optimize(x, y, SVM_training_loss)
print "Weights", weights
In [ ]:
delta = 0.1
xx = np.arange(-4.0, 6.0, delta)
yy = np.arange(-4.0, 6.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1)
Z = SVM_predictions(weights,test_data)
fig, ax = plt.subplots(1, 1)
fig.set_facecolor('#EAEAF2')
Z = np.reshape(Z,(xx.shape[0],xx.shape[0]))
plt.contour(XX,YY,Z,[0])
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.25)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.5,color='pink')
In [ ]:
# your code here
In [ ]:
%reset
In [ ]:
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
import matplotlib
import autograd.numpy as np
from autograd import grad
from autograd.util import flatten
# Display plots inline and change default figure size
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (6.0, 4.0)
# Generate a dataset and plot it
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, alpha=0.45)
Let's now build a 3-layer neural network with one input layer, one hidden layer, and one output layer. The number of nodes in the input layer is determined by the dimensionality of our data, 2. Similarly, the number of nodes in the output layer is determined by the number of classes we have, also 2.
Our network makes predictions using forward propagation, which is just a bunch of matrix multiplications and the application of the activation function(s). If $x$ is the 2-dimensional input to our network then we calculate our prediction $\hat{y}$ (also two-dimensional) as follows:
$$ z_1 = x W_1 + b_1 $$$$ a_1 = \mbox{tanh}(z_1) $$$$ z_2 = a_1 W_2 + b_2$$$$ a_2 = \mbox{softmax}({z_2})$$$W_1, b_1, W_2, b_2$ are parameters of our network, which we need to learn from our training data. You can think of them as matrices transforming data between layers of the network. Looking at the matrix multiplications above we can figure out the dimensionality of these matrices. If we use 500 nodes for our hidden layer then $W_1 \in \mathbb{R}^{2\times500}$, $b_1 \in \mathbb{R}^{500}$, $W_2 \in \mathbb{R}^{500\times2}$, $b_2 \in \mathbb{R}^{2}$.
A common choice with the softmax output is the cross-entropy loss. If we have $N$ training examples and $C$ classes then the loss for our prediction $\hat{y}$ with respect to the true labels $y$ is given by:
$$ \begin{aligned} L(y,\hat{y}) = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \log\hat{y}_{n,i} \end{aligned} $$
In [ ]:
num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality
nn_output_dim = 2 # output layer dimensionality
# Gradient descent parameters
epsilon = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength
def calculate_loss(model):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# Forward propagation to calculate our predictions
z1 = np.dot(X,W1) + b1
a1 = np.tanh(z1)
z2 = np.dot(a1,W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Calculating the loss
corect_logprobs = -np.log(probs[range(num_examples), y])
data_loss = np.sum(corect_logprobs)
# Add regulatization term to loss (optional)
data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
return 1./num_examples * data_loss
# output (0 or 1)
def predict(model, x):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# Forward propagation
z1 = np.dot(x,W1) + b1
a1 = np.tanh(z1)
z2 = np.dot(a1,W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
return np.argmax(probs, axis=1)
This is a version that solves the optimization problem by using the backpropagation algorithm (hand-coded derivatives):
In [ ]:
# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):
# Initialize the parameters to random values. We need to learn these.
np.random.seed(0)
W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
b1 = np.zeros((1, nn_hdim))
W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
b2 = np.zeros((1, nn_output_dim))
# This is what we return at the end
model = {}
# Gradient descent. For each batch...
for i in xrange(0, num_passes):
# Forward propagation
z1 = np.dot(X,W1) + b1
a1 = np.tanh(z1)
z2 = np.dot(a1,W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Backpropagation
delta3 = probs
delta3[range(num_examples), y] -= 1
dW2 = (a1.T).dot(delta3)
db2 = np.sum(delta3, axis=0, keepdims=True)
delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
dW1 = np.dot(X.T, delta2)
db1 = np.sum(delta2, axis=0)
# Add regularization terms (b1 and b2 don't have regularization terms)
dW2 += reg_lambda * W2
dW1 += reg_lambda * W1
# Gradient descent parameter update
W1 += -epsilon * dW1
b1 += -epsilon * db1
W2 += -epsilon * dW2
b2 += -epsilon * db2
# Assign new parameters to the model
model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print "Loss after iteration %i: %f" %(i, calculate_loss(model))
return model
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.45)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.45)
# Build a model with a 3-dimensional hidden layer
model = build_model(3, print_loss=True)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")
The next version solves the optimization problem by using AD:
In [ ]:
# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):
# Initialize the parameters to random values. We need to learn these.
np.random.seed(0)
W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
b1 = np.zeros((1, nn_hdim))
W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
b2 = np.zeros((1, nn_output_dim))
# This is what we return at the end
model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
# Gradient descent. For each batch...
for i in xrange(0, num_passes):
# Forward propagation
z1 = np.dot(X,model['W1']) + model['b1']
a1 = np.tanh(z1)
z2 = np.dot(a1,model['W2']) + model['b2']
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
gradient_loss = grad(calculate_loss)
model_flat, unflatten_m = flatten(model)
grad_flat, unflatten_g = flatten(gradient_loss(model))
model_flat -= grad_flat * 0.05
model = unflatten_m(model_flat)
# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print "Loss after iteration %i: %f" %(i, calculate_loss(model))
return model
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.45)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.45)
# Build a model with a 3-dimensional hidden layer
model = build_model(3, print_loss=True)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")
Let's now get a sense of how varying the hidden layer size affects the result.
In [ ]:
plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50]
for i, nn_hdim in enumerate(hidden_layer_dimensions):
plt.subplot(5, 2, i+1)
plt.title('Hidden Layer size %d' % nn_hdim)
model = build_model(nn_hdim)
plot_decision_boundary(lambda x: predict(model, x))
plt.show()
A naïve description of the basic setup in deep learning would be: a many-layers-deep network of linear layers, linear convolutions and occasionally recurrent structures with nonlinear activation functions that is trained by computing the gradient of the training error through reverse mode AD..
AD is a critical component when developing deep models because the use of SGD is much more easy and robust (f.e. derivative computation is free of bugs!), but in spite of this fact optimization of deep models is not yet an easy task. Gradient-based optimization still suffers from some problems:
In order to address each issues, deep learning community has developed some tricks. First, gradient tricks, namely methods to make the gradient either easier to calculate or to give it more desirable properties. And second, optimization tricks, namely new methods related to stochastic optimization.
The deep learning community has been developing many methods to make gradient descent work with deep architectures. These methods (related to model architectures and gradient calculation) are numerous. Here we discuss only a small sample of methods.