The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams. (Michael Nielsen in "Neural Networks and Deep Learning", http://neuralnetworksanddeeplearning.com/chap2.html).
Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years. (Christopher Olah, 2016)
We have seen that in order to optimize our models we need to compute the derivative of the loss function with respect to all model paramaters.
The computation of derivatives in computer models is addressed by four main methods:
When training large and deep neural networks, AD is the only practical alternative.
Automatic differentiation (AD) works by systematically applying the chain rule of differential calculus at the elementary operator level.
Let $ y = f(g(x)) $ our target function. In its basic form, the chain rule states:
$$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$or, if there are more than one variable $g_i$ in-between $y$ and $x$ (f.e. if $f$ is a two dimensional function such as $f(g_1(x), g_2(x))$), then:
$$ \frac{\partial f}{\partial x} = \sum_i \frac{\partial f}{\partial g_i} \frac{\partial g_i}{\partial x} $$For example, let's consider the derivative of this function:
$$ f(x) = \frac{1}{1 + e^{- ({w}^T \cdot x + b)}} $$Now, let's write how to evaluate $f(x)$ via a sequence of primitive operations:
x = ?
f1 = w * x
f2 = f1 + b
f3 = -f2
f4 = 2.718281828459 ** f3
f5 = 1.0 + f4
f = 1.0/f5
The question mark indicates that $x$ is a value that must be provided. This program can compute the value of $x$ and also populate program variables.
We can evaluate $\frac{\partial f}{\partial x}$ at some $x$ by using the chain rule. This is called forward-mode differentiation. In our case:
def dfdx_forward(x, w, b):
f1 = w * x
df1 = w # = df1/dx
f2 = f1 + b
df2 = df1 * 1.0 # = df1 * df2/df1
f3 = -f2
df3 = df2 * -1.0 # = df2 * df3/df2
f4 = 2.718281828459 ** f3
df4 = df3 * 2.718281828459 ** f3 # = df3 * df4/df3
f5 = 1.0 + f4
df5 = df4 * 1.0 # = df4 * df5/df4
df6 = df5 * -1.0 / f5 ** 2.0 # = df5 * df6/df5
return df6
It is interesting to note that this program can be readily executed if we have access to subroutines implementing the derivatives of primitive functions (such as $\exp{(x)}$ or $1/x$) and all intermediate variables are computed in the right order. It is also interesting to note that AD allows the accurate evaluation of derivatives at machine precision, with only a small constant factor of overhead.
Forward differentiation is efficient for functions $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ with $n << m$ (only $O(n)$ sweeps are necessary). For cases $n >> m$ a different technique is needed. To this end, we will rewrite the chain rule as:
$$ \frac{\partial f}{\partial x} = \frac{\partial g}{\partial x} \frac{\partial f}{\partial g} $$to propagate derivatives backward from a given output. This is called reverse-mode differentiation. Reverse pass starts at the end (i.e. $\frac{\partial f}{\partial f} = 1$) and propagates backward to all dependencies.
def dfdx_backward(x, w, b):
f1 = w * x
f2 = f1 + b
f3 = -f2
f4 = 2.718281828459 ** f3
f5 = 1.0 + f4
f6 = 1.0/f5
df6 = 1.0 # = df/df6
df5 = 1.0 * -1.0 / (f5 * f5) * df6 # = df6 * df6/df5
df4 = df5 * 1.0 # = df5 * df5/df4
df3 = df4 * numpy.log(2.718281828459) \
* 2.718281828459 ** f3 # = df4 * df4/df3
df2 = df3 * -1.0 # = df3 * df3/df2
df1 = df2 * 1.0 # = df2 * df2/df1
df = df1 * w # = df1 * df1/dx
return df
In practice, reverse-mode differentiation is a two-stage process. In the first stage the original function code is run forward, populating $f_i$ variables. In the second stage, derivatives are calculated by propagating in reverse, from the outputs to the inputs.
The most important property of reverse-mode differentiation is that it is cheaper than forward-mode differentiation for functions with a high number of input variables. In our case, $f : \mathbb{R}^n \rightarrow \mathbb{R}$, only one application of the reverse mode is sufficient to compute the full gradient of the function $\nabla f = \big( \frac{\partial y}{\partial x_1}, \dots ,\frac{\partial y}{\partial x_n} \big)$. This is the case of deep learning, where the number of input variables is very high.
As we have seen, AD relies on the fact that all numerical computations are ultimately compositions of a finite set of elementary operations for which derivatives are known. For this reason, given a library of derivatives of all elementary functions in a deep neural network, we are able of computing the derivatives of the network with respect to all parameters at machine precision and applying stochastic gradient methods to its training. Without this automation process the design and debugging of optimization processes for complex neural networks with millions of parameters would be impossible.
Autograd is a Python module (with only one function) that implements automatic differentiation.
In [17]:
!pip install autograd
Autograd can automatically differentiate Python and Numpy code:
In [18]:
import autograd.numpy as np
from autograd import grad
x = np.array([2, 5], dtype=float)
def test(x):
return np.log(x[0]) + x[0]*x[1] - np.sin(x[1])
grad_test = grad(test)
print "({:.2f},{:.2f})".format(grad_test(x)[0],grad_test(x)[1])
Then, logistic regression model fitting
$$ f(x) = \frac{1}{1 + \exp^{-(w_0 + w_1 x)}} $$can be implemented in this way:
In [19]:
import autograd.numpy as np
from autograd import grad
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def logistic_predictions(weights, inputs):
return sigmoid(np.dot(inputs, weights))
def training_loss(weights, inputs, targets):
preds = logistic_predictions(weights, inputs)
label_probabilities = preds * targets + (1 - preds) * (1 - targets)
return -np.sum(np.log(label_probabilities))
def optimize(inputs, targets, training_loss):
# Optimize weights using gradient descent.
gradient_loss = grad(training_loss)
weights = np.zeros(inputs.shape[1])
print "Initial loss:", training_loss(weights, inputs, targets)
for i in xrange(100):
weights -= gradient_loss(weights, inputs, targets) * 0.01
print "Final loss:", training_loss(weights, inputs, targets)
return weights
# Build a toy dataset.
inputs = np.array([[0.52, 1.12, 0.77],
[0.88, -1.08, 0.15],
[0.52, 0.06, -1.30],
[0.74, -2.49, 1.39]])
targets = np.array([True, True, False, True])
weights = optimize(inputs, targets, training_loss)
print "Weights:", weights
Any complex function that can be decomposed in a set of elementary functions can be derived in an automatic way, at machine precision, by this algorithm!
We no longer need to code complex derivatives to apply SGD!
In [20]:
%reset
In [21]:
import numpy as np
#Example dataset
N_samples_per_class = 100
d_dimensions = 2
x = np.vstack((np.random.randn(N_samples_per_class, d_dimensions),
np.random.randn(N_samples_per_class, d_dimensions)
+np.array([5,5])))
y = np.concatenate([-1.0*np.ones(N_samples_per_class),
1.*np.ones(N_samples_per_class)])
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
fig.set_facecolor('#EAEAF2')
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.25)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.5,color='pink')
Out[21]:
In [ ]:
import autograd.numpy as np
from autograd import grad
def SVM_predictions(w, inputs):
return np.dot(w[0,:-1],inputs.T)+w[0,-1]
#your code in this funtion
def SVM_training_loss(weights, inputs, targets):
pred = SVM_predictions(weights, inputs)
pass
def optimize(inputs, targets, training_loss):
gradient_loss = grad(training_loss)
weights = np.zeros((1,inputs.shape[1]+1))
print "Initial loss:", training_loss(weights, inputs, targets)
for i in xrange(10000):
weights -= gradient_loss(weights, inputs, targets) * 0.05
if i%1000 == 0:
print " Loss:", training_loss(weights, inputs, targets)
print "Final loss:", training_loss(weights, inputs, targets)
return weights
weights = optimize(x, y, SVM_training_loss)
print "Weights", weights
In [ ]:
delta = 0.1
xx = np.arange(-4.0, 10.0, delta)
yy = np.arange(-4.0, 10.0, delta)
XX, YY = np.meshgrid(xx, yy)
Xf = XX.flatten()
Yf = YY.flatten()
sz=XX.shape
test_data = np.concatenate([Xf[:,np.newaxis],Yf[:,np.newaxis]],axis=1)
Z = SVM_predictions(weights,test_data)
fig, ax = plt.subplots(1, 1)
fig.set_facecolor('#EAEAF2')
Z = np.reshape(Z,(xx.shape[0],xx.shape[0]))
plt.contour(XX,YY,Z,[0])
idx = y==1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.25)
idx = y==-1
plt.scatter(x[idx.ravel(),0],x[idx.ravel(),1],alpha=0.5,color='pink')
In [24]:
%reset
In [25]:
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
import matplotlib
import autograd.numpy as np
from autograd import grad
from autograd.misc import flatten
# Display plots inline and change default figure size
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (6.0, 4.0)
# Generate a dataset and plot it
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, alpha=0.45)
Out[25]:
Let's now build a 3-layer neural network with one input layer, one hidden layer, and one output layer. The number of nodes in the input layer is determined by the dimensionality of our data, 2. Similarly, the number of nodes in the output layer is determined by the number of classes we have, also 2.
Our network makes predictions using forward propagation, which is just a bunch of matrix multiplications and the application of the activation function(s). If $x$ is the 2-dimensional input to our network then we calculate our prediction $\hat{y}$ (also two-dimensional) as follows:
$$ z_1 = x W_1 + b_1 $$$$ a_1 = \mbox{tanh}(z_1) $$$$ z_2 = a_1 W_2 + b_2$$$$ a_2 = \mbox{softmax}({z_2})$$$W_1, b_1, W_2, b_2$ are parameters of our network, which we need to learn from our training data. You can think of them as matrices transforming data between layers of the network. Looking at the matrix multiplications above we can figure out the dimensionality of these matrices. If we use 500 nodes for our hidden layer then $W_1 \in \mathbb{R}^{2\times500}$, $b_1 \in \mathbb{R}^{500}$, $W_2 \in \mathbb{R}^{500\times2}$, $b_2 \in \mathbb{R}^{2}$.
A common choice with the softmax output is the cross-entropy loss. If we have $N$ training examples and $C$ classes then the loss for our prediction $\hat{y}$ with respect to the true labels $y$ is given by:
$$ \begin{aligned} L(y,\hat{y}) = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \log\hat{y}_{n,i} \end{aligned} $$
In [26]:
num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality
nn_output_dim = 2 # output layer dimensionality
# Gradient descent parameters
epsilon = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength
def calculate_loss(model):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# Forward propagation to calculate our predictions
z1 = np.dot(X,W1) + b1
a1 = np.tanh(z1)
z2 = np.dot(a1,W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Calculating the loss
corect_logprobs = -np.log(probs[range(num_examples), y])
data_loss = np.sum(corect_logprobs)
# Add regulatization term to loss (optional)
data_loss += reg_lambda/2 * (np.sum(np.square(W1))
+ np.sum(np.square(W2)))
return 1./num_examples * data_loss
# output (0 or 1)
def predict(model, x):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# Forward propagation
z1 = np.dot(x,W1) + b1
a1 = np.tanh(z1)
z2 = np.dot(a1,W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
return np.argmax(probs, axis=1)
This is a version that solves the optimization problem by using the backpropagation algorithm (hand-coded derivatives):
In [27]:
# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):
# Initialize the parameters to random values. We need to learn these.
np.random.seed(0)
W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
b1 = np.zeros((1, nn_hdim))
W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
b2 = np.zeros((1, nn_output_dim))
# This is what we return at the end
model = {}
# Gradient descent. For each batch...
for i in xrange(0, num_passes):
# Forward propagation
z1 = np.dot(X,W1) + b1
a1 = np.tanh(z1)
z2 = np.dot(a1,W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Backpropagation
delta3 = probs
delta3[range(num_examples), y] -= 1
dW2 = (a1.T).dot(delta3)
db2 = np.sum(delta3, axis=0, keepdims=True)
delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
dW1 = np.dot(X.T, delta2)
db1 = np.sum(delta2, axis=0)
# Add regularization terms (b1 and b2 don't have regularization terms)
dW2 += reg_lambda * W2
dW1 += reg_lambda * W1
# Gradient descent parameter update
W1 += -epsilon * dW1
b1 += -epsilon * db1
W2 += -epsilon * dW2
b2 += -epsilon * db2
# Assign new parameters to the model
model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print "Loss after iteration %i: %f" %(i, calculate_loss(model))
return model
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.45)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.45)
# Build a model with a 3-dimensional hidden layer
model = build_model(3, print_loss=True)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")
Out[27]:
The next version solves the optimization problem by using AD:
In [28]:
# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):
# Initialize the parameters to random values. We need to learn these.
np.random.seed(0)
W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
b1 = np.zeros((1, nn_hdim))
W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
b2 = np.zeros((1, nn_output_dim))
# This is what we return at the end
model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
# Gradient descent. For each batch...
for i in xrange(0, num_passes):
# Forward propagation
z1 = np.dot(X,model['W1']) + model['b1']
a1 = np.tanh(z1)
z2 = np.dot(a1,model['W2']) + model['b2']
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
gradient_loss = grad(calculate_loss)
model_flat, unflatten_m = flatten(model)
grad_flat, unflatten_g = flatten(gradient_loss(model))
model_flat -= grad_flat * 0.05
model = unflatten_m(model_flat)
# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print "Loss after iteration %i: %f" %(i, calculate_loss(model))
return model
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.45)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.45)
# Build a model with a 3-dimensional hidden layer
model = build_model(3, print_loss=True)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")
Out[28]:
Let's now get a sense of how varying the hidden layer size affects the result.
In [29]:
plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50]
for i, nn_hdim in enumerate(hidden_layer_dimensions):
plt.subplot(5, 2, i+1)
plt.title('Hidden Layer size %d' % nn_hdim)
model = build_model(nn_hdim)
plot_decision_boundary(lambda x: predict(model, x))
plt.show()
A naïve description of the basic setup in deep learning would be: a many-layers-deep network of linear layers, linear convolutions and occasionally recurrent structures with nonlinear activation functions that is trained by computing the gradient of the training error through reverse mode AD..
AD is a critical component when developing deep models because the use of SGD is much more easy and robust (f.e. derivative computation is free of bugs!), but in spite of this fact optimization of deep models is not yet an easy task. Gradient-based optimization still suffers from some problems:
In order to address each issues, deep learning community has developed some tricks. First, gradient tricks, namely methods to make the gradient either easier to calculate or to give it more desirable properties. And second, optimization tricks, namely new methods related to stochastic optimization.
The deep learning community has been developing many methods to make gradient descent work with deep architectures. These methods (related to model architectures and gradient calculation) are numerous. Here we discuss only a small sample of methods.