Notes taken to help for the first project for the Deep Learning Foundations Nanodegree course dellivered by Udacity.
My Github repo for this project can be found here: adriantorrie/udacity_dlfnd_project_1
In [1]:
%run ../../../code/version_check.py
Date Created: 2017-02-06
Date of Change Change Notes
-------------- ----------------------------------------------------------------
2017-02-06 Initial draft
2017-03-23 Formatting changes for online publishing
In [2]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
plt.style.use('bmh')
matplotlib.rcParams['figure.figsize'] = (15, 4)
<img src="../../../../images/simple-nn.png",width=450,height=200>
<img src="../../../../images/and-or-perceptron.png",width=450,height=200>
The NOT operations only cares about one input. The other inputs to the perceptron are ignored.
An XOR perceptron is a logic gate that outputs 0 if the inputs are the same and 1 if the inputs are different. <img src="../../../../images/xor-perceptron.png",width=450,height=200>
Activation functions can be for
Binary activation functions include:
Multi-class activation functions include:
6.2.2 Output Units
6.3 Hidden Units
6.3.1 Rectified Linear Units and Their Generalizations
6.3.2 Logistic Sigmoid and Hyperbolic Tangent
6.6 Historical notes
where:
and
\begin{equation*} h = \Sigma_i w_i x_i + b \end{equation*}where:
<img src="../../../../images/artificial-neural-network.png", width=450, height=200>
A sigmoid function is a mathematical function having an "S" shaped curve (sigmoid curve). Often, sigmoid function refers to the special case of the logistic function.
The sigmoid function is bounded between 0 and 1, and as an output can be interpreted as a probability for success.
where:
In [3]:
def sigmoid(x):
s = 1 / (1 + np.exp(-x))
return s
In [4]:
inputs = np.array([2.1, 1.5,])
weights = np.array([0.2, 0.5,])
bias = -0.2
output = sigmoid(np.dot(weights, inputs) + bias)
print(output)
In [5]:
x = np.linspace(start=-10, stop=11, num=100)
y = sigmoid(x)
upper_bound = np.repeat([1.0,], len(x))
success_threshold = np.repeat([0.5,], len(x))
lower_bound = np.repeat([0.0,], len(x))
plt.plot(
# upper bound
x, upper_bound, 'w--',
# success threshold
x, success_threshold, 'w--',
# lower bound
x, lower_bound, 'w--',
# sigmoid
x, y
)
plt.grid(False)
plt.xlabel(r'$x$')
plt.ylabel(r'Probability of success')
plt.title('Sigmoid Function Example')
plt.show()
Just as the points (cos t, sin t) form a circle with a unit radius, the points (cosh t, sinh t) form the right half of the equilateral hyperbola.
The tanh function is bounded between -1 and 1, and as an output can be interpreted as a probability for success, where the output value:
The tanh function creates stronger gradients around zero, and therefore the derivatives are higher than the sigmoid function. Why this is important can apparently be found in Effecient Backprop by LeCun et al (1998). Also see this answer on Cross-Validated for a representation of the derivative values.
where:
In [6]:
inputs = np.array([2.1, 1.5,])
weights = np.array([0.2, 0.5,])
bias = -0.2
output = np.tanh(np.dot(weights, inputs) + bias)
print(output)
In [7]:
x = np.linspace(start=-10, stop=11, num=100)
y = np.tanh(x)
upper_bound = np.repeat([1.0,], len(x))
success_threshold = np.repeat([0.0,], len(x))
lower_bound = np.repeat([-1.0,], len(x))
plt.plot(
# upper bound
x, upper_bound, 'w--',
# success threshold
x, success_threshold, 'w--',
# lower bound
x, lower_bound, 'w--',
# sigmoid
x, y
)
plt.grid(False)
plt.xlabel(r'$x$')
plt.ylabel(r'Probability of success (0.00 = 50%)')
plt.title('Tanh Function Example')
plt.show()
In [8]:
def modified_tanh(x):
return 1.7159 * np.tanh((2 / 3) * x)
x = np.linspace(start=-10, stop=11, num=100)
y = modified_tanh(x)
upper_bound = np.repeat([1.75,], len(x))
success_threshold = np.repeat([0.0,], len(x))
lower_bound = np.repeat([-1.75,], len(x))
plt.plot(
# upper bound
x, upper_bound, 'w--',
# success threshold
x, success_threshold, 'w--',
# lower bound
x, lower_bound, 'w--',
# sigmoid
x, y
)
plt.grid(False)
plt.xlabel(r'$x$')
plt.ylabel(r'Probability of success (0.00 = 50%)')
plt.title('Alternative Tanh Function Example')
plt.show()
Softmax regression is interested in multi-class classification (as opposed to only binary classification when using the sigmoid and tanh functions), and so the label $y$ can take on $K$ different values, rather than only two.
Is often used as the output layer in multilayer perceptrons to allow non-linear relationships to be learnt for multiclass problems.
From Deep Learning Book - Chapter 4: Numerical Computation
\begin{equation*} \text{softmax}(x)_i = \frac{\text{exp}(x_i)} {\sum_{j=1}^n \text{exp}(x_j)} \end{equation*}Link for a good discussion on SO regarding Python implementation of this function, from which the code below code was taken from.
In [9]:
def softmax(X):
assert len(X.shape) == 2
s = np.max(X, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(X - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
X = np.array([[1, 2, 3, 6],
[2, 4, 5, 6],
[3, 8, 7, 6]])
y = softmax(X)
y
Out[9]:
In [10]:
# compared to tensorflow implementation
batch = np.asarray([[1,2,3,6], [2,4,5,6], [3, 8, 7, 6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})
Out[10]:
What if you want to perform an operation, such as predicting college admission, but don't know the correct weights? You'll need to learn the weights from example data, then use those weights to make the predictions.
We need a metric of how wrong the predictions are, the error.
where (neural network prediction):
\begin{equation*} \hat y_j^\mu = f \left(\Sigma_i w_{ij} x_i^\mu\right) \end{equation*}therefore:
\begin{equation*} E = \frac{1}{2} \Sigma_u \Sigma_j \left [ y_j ^ \mu - f \left(\Sigma_i w_{ij} x_i^\mu\right) \right ] ^ 2 \end{equation*}Find weights $w_{ij}$ that minimize the squared error $E$.
How? Gradient descent.
remembering $h_j$ is the input to the output unit $j$: \begin{equation*} h = \sum_i w_{ij} x_i \end{equation*}
where:
The errors can be rewritten as: \begin{equation*} \delta_j = (y_j - \hat y_j) f^\prime (h_j) \end{equation*}
Giving the gradient step as: \begin{equation*} \Delta w_{ij} = \eta \delta_j x_i \end{equation*}
where:
In [11]:
# Defining the sigmoid function for activations
def sigmoid(x):
return 1 / ( 1 + np.exp(-x))
# Derivative of the sigmoid function
def sigmoid_prime(x):
return sigmoid(x) * (1 - sigmoid(x))
x = np.array([0.1, 0.3])
y = 0.2
weights = np.array([-0.8, 0.5])
# probably use a vector named "w" instead of a name like this
# to make code look more like algebra
# The learning rate, eta in the weight step equation
learnrate = 0.5
# The neural network output
nn_output = sigmoid(x[0] * weights[0] + x[1] * weights[1])
# or nn_output = sigmoid(np.dot(weights, x))
# output error
error = y - nn_output
# error gradient
error_gradient = error * sigmoid_prime(np.dot(x, weights))
# sigmoid_prime(x) is equal to -> nn_output * (1 - nn_output)
# Gradient descent step
del_w = [ learnrate * error_gradient * x[0],
learnrate * error_gradient * x[1]]
# or del_w = learnrate * error_gradient * x
Gradient descent is reliant on beginnning weight values. If incorrect could result in convergergance occuring in a local minima, not a global minima. Random weights can be used.
The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
Numpy arays are row vectors by default, and the input_features.T
(transpose) transform still leaves it as a row vector. Instead we have to use (use this one, makes more sense):
input_features = input_features[:, None]
Alternatively you can create an array with two dimensions then transpose it:
input_features = np.array(input_features ndim=2).T
In [12]:
# network size is a 4x3x2 network
n_input = 4
n_hidden = 3
n_output = 2
# make some fake data
np.random.seed(42)
x = np.random.randn(4)
weights_in_hidden = np.random.normal(0, scale=0.1, size=(n_input, n_hidden))
weights_hidden_out = np.random.normal(0, scale=0.1, size=(n_hidden, n_output))
print('x shape\t\t\t= {}'.format(x.shape))
print('weights_in_hidden shape\t= {}'.format(weights_in_hidden.shape))
print('weights_hidden_out\t= {}'.format(weights_hidden_out.shape))
When doing the feed forward pass we are taking the inputs and using the weights (which are randomly assigned) to gain an output. In backprop you can view this as the errors (difference between prediction and actual expected value) being passed through the network using the weights again.
<img src="../../../../images/backprop-network.png", width=150>
From this example, you can see one of the effects of using the sigmoid function for the activations. The maximum derivative of the sigmoid function is 0.5, so the errors in the output layer get scaled by at least half, and errors in the hidden layer are scaled down by at least a quarter. You can see that if you have a lot of layers, using a sigmoid activation function will quickly reduce the weight steps to tiny values in layers near the input.