Part of iPyMacLern project.

Copyright (C) 2015 by Eka A. Kurniawan

eka.a.kurniawan(ta)gmail(tod)com

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Tested On


In [1]:
import sys
print("Python %d.%d.%d" % (sys.version_info.major, \
                           sys.version_info.minor, \
                           sys.version_info.micro))


Python 3.3.5

In [2]:
import numpy as np
print("NumPy %s" % np.__version__)


NumPy 1.8.0

In [3]:
import scipy
import scipy.io as sio
from scipy.special import expit
from scipy.optimize import minimize
print("SciPy %s" % scipy.__version__)


SciPy 0.13.3

In [4]:
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib %s" % matplotlib.__version__)


matplotlib 1.3.1

Import Modules


In [5]:
import time
import math
from random import randint
import matplotlib.cm as cm
import pylab

Display Settings


In [6]:
# Display graph inline
%matplotlib inline

# Display graph in 'retina' format for Mac with retina display. Others, use PNG or SVG format.
%config InlineBackend.figure_format = 'retina'
#%config InlineBackend.figure_format = 'PNG'
#%config InlineBackend.figure_format = 'SVG'

Handwritten Digit Dataset[1]

Randomly display digit images with the labels from training examples.


In [7]:
def display_images(images, labels, ttl_subplots=10):
    m = len(images)
    
    f, axarr = plt.subplots(ttl_subplots, ttl_subplots, figsize=(8, 8))
    for pi in range(ttl_subplots):
        for pj in range(ttl_subplots):
            i = randint(0, m-1)
            axarr[pi, pj].set_title(str(y[i][0]))
            axarr[pi, pj].imshow(np.resize(images[i], (20,20)).transpose(), 
                                 cmap = cm.Greys_r)
            axarr[pi, pj].axes.get_xaxis().set_visible(False)
            axarr[pi, pj].axes.get_yaxis().set_visible(False)
            axarr[pi, pj].axes.get_xaxis().set_ticks([])
            axarr[pi, pj].axes.get_yaxis().set_ticks([])
    plt.tight_layout()
    plt.show()

In [8]:
training_examples = sio.loadmat('ex4data1.mat')
X = training_examples['X']
y = training_examples['y']

In [9]:
display_images(X, y, ttl_subplots=10)



Logistic Model

Sigmoid/Logistic Activation Function[2]

$$g(z) = \frac{1}{1+e^{-z}}$$

In [10]:
sigmoid = lambda z: 1 / (1 + math.exp(-z))
sigmoid_v = np.vectorize(sigmoid, otypes=[np.float])

expit function is built-in function for sigmoid function.


In [11]:
zs = np.linspace(-10.0, 10.0, num=201)
gs = expit(zs)

In [12]:
plt.scatter(zs, gs, color='red', s=2)
plt.title('Logistic/Sigmoid Function')
plt.xlabel('zs')
plt.ylabel('gs')
plt.show()


Logistic Regression Cost Fucntion[3]

$$ Cost(h_\theta(x),y) = \left\{ \begin{array}{ll} -log(h_\theta(x)) & \mbox{if $y = 1$}\\ -log(1-h_\theta(x)) & \mbox{if $y = 0$} \end{array} \right. $$

Visualisation of $-log(h_\theta(x))$ for $y = 1$


In [13]:
nlog = lambda x: -math.log(x)
nlog_v = np.vectorize(nlog, otypes=[np.float])

In [14]:
htx = np.linspace(0.01, 0.99, num=200)
lhtx = nlog_v(htx)

In [15]:
plt.scatter(htx, lhtx, color='red', s=2)
plt.title('Cost Function for y=1')
plt.xlabel('$h_\Theta(x)$')
plt.ylabel('$-log(h_\Theta(x))$')
plt.show()


Visualisation of $-log(1-h_\theta(x))$ for $y = 0$


In [16]:
nlog_inv = lambda x: -math.log(1 - x)
nlog_inv_v = np.vectorize(nlog_inv, otypes=[np.float])

In [17]:
htx = np.linspace(0.01, 0.99, num=200)
lhtx = nlog_inv_v(htx)

In [18]:
plt.scatter(htx, lhtx, color='red', s=2)
plt.title('Cost Function for y=0')
plt.xlabel('$h_\Theta(x)$')
plt.ylabel('$-log(1-h_\Theta(x))$')
plt.show()


Mathematical representation of the cost function.

$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left[ -y^{(i)}log(h_\theta(x^{(i)}))- (1-y^{(i)})log(1-h_\theta(x^{(i)})) \right]$$

Cost function for all 10 digit lables ($K$ = 10).

$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \left[ -y_k^{(i)}log((h_\theta(x^{(i)}))_k)- (1-y_k^{(i)})log(1-(h_\theta(x^{(i)}))_k) \right]$$

Finally, add regularisation to the cost function.

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \left[ -y_k^{(i)}log((h_\theta(x^{(i)}))_k)- (1-y_k^{(i)})log(1-(h_\theta(x^{(i)}))_k) \right] + \\ \frac{\lambda}{2m}\left[ \sum_{j=1}^{L^{hidden}} \sum_{k=1}^{L^{input}} (\theta_{j,k}^{(1)})^2 + \sum_{j=1}^{L^{output}} \sum_{k=1}^{L^{hidden}} (\theta_{j,k}^{(2)})^2 \right] $$

Sigmoid Gradient[3]

First derivation of sigmoid function.

$$ g'(z) = \frac{d}{dz} g(z) = g(z)(1-g(z))$$

In [19]:
sgrad_in = np.random.rand(5000, 25)

Use our own build using above formula.


In [20]:
sigmoid_grad_v = np.vectorize(lambda z: sigmoid_v(z) * (1 - sigmoid_v(z)), 
                              otypes=[np.float])

In [21]:
sigmoid_grad_v(np.array([0,1,2,0]))


Out[21]:
array([ 0.25      ,  0.19661193,  0.10499359,  0.25      ])

In [22]:
tic = time.time()
sgv_out = sigmoid_grad_v(sgrad_in)
toc = time.time()
print('Runtime: %s seconds' % int(toc - tic))
sgv_out.shape


Runtime: 4 seconds
Out[22]:
(5000, 25)

Use built-in SciPy expit sigmoid function.


In [23]:
expit_grad_v = np.vectorize(lambda z: expit(z) * (1 - expit(z)), 
                            otypes=[np.float])

In [24]:
expit_grad_v(np.array([0,1,2,0]))


Out[24]:
array([ 0.25      ,  0.19661193,  0.10499359,  0.25      ])

In [25]:
tic = time.time()
egv_out = expit_grad_v(sgrad_in)
toc = time.time()
print('Runtime: %s seconds' % int(toc - tic))
egv_out.shape


Runtime: 0 seconds
Out[25]:
(5000, 25)

Use built-in SciPy PDF sigmoid gradient function.

The CDF of the logistic distribution is the sigmoid function. It is available as the cdf method of scipy.stats.logistic, but cdf eventually calls expit, so there is no point in using that method. You can use the pdf method to compute the derivative of the sigmoid function, or the _pdf method which has less overhead, but "rolling your own" is faster

Warren Weckesser


In [26]:
from scipy.stats import logistic

In [27]:
logistic._pdf(np.array([0,1,2,0]))


Out[27]:
array([ 0.25      ,  0.19661193,  0.10499359,  0.25      ])

In [28]:
tic = time.time()
sgpdfv_out = logistic._pdf(sgrad_in)
toc = time.time()
print('Runtime: %s seconds' % int(toc - tic))
sgpdfv_out.shape


Runtime: 0 seconds
Out[28]:
(5000, 25)

Backpropagation Algorithm[3]

Delta for output layer ($\delta_k^{(3)}$) is calculated for each digit label ($k$) of all digits ($K$).

$$ \delta_k^{(3)} = a_k^{(3)} - y_k $$

Delta for output layer ($\delta^{(2)}$) is calculated as follow.

$$ \delta^{(2)} = (\theta^{(2)})^T \delta^{(3)} \times g'(z^{(2)}) $$

Logistic Regression Cost and Gradient Function


In [29]:
def compute_cost_grad(parameters, 
                      input_layer_size, hidden_layer_size, num_labels, 
                      Xb, y_k, m, lmda):
    
    # Extract parameters
    theta1_size = hidden_layer_size * (input_layer_size + 1)
    theta1 = parameters[:theta1_size].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = parameters[theta1_size:].reshape(num_labels, hidden_layer_size + 1)
    
    theta1_wob = theta1[:, 1:]                  # Theta 1 without bias unit
    theta2_wob = theta2[:, 1:]                  # Theta 2 without bias unit
    
    # Compute cost function (feedforward)
    z2 = np.dot(Xb, theta1.transpose())
    a2 = expit(z2)
    a2b = np.column_stack([np.ones(m), a2])     # a2 with bias unit
    z3 = np.dot(a2b, theta2.transpose())
    h_theta = expit(z3)
    
    J = (sum(sum((y_k * nlog_v(h_theta)) + ((1 - y_k) * nlog_inv_v(h_theta)))) / m) + \
        ((sum(sum(theta1_wob ** 2)) + sum(sum(theta2_wob ** 2))) * lmda / (2 * m))
    
    # Compute gradient using backpropagation algorithm
    delta_3 = h_theta - y_k
    delta_2 = np.dot(theta2.transpose(), delta_3.transpose()).transpose()[:,1:] * logistic._pdf(z2)

    theta1_grad = (np.dot(delta_2.transpose(), Xb)  / m) + np.hstack([np.zeros((hidden_layer_size,1)), ((lmda / m) * theta1_wob)])
    theta2_grad = (np.dot(delta_3.transpose(), a2b) / m) + np.hstack([np.zeros((num_labels,1)),        ((lmda / m) * theta2_wob)])
    
    grad = np.hstack([theta1_grad.flatten(), theta2_grad.flatten()])
    
    return J, grad

Neural Network [3]

Neural Network Parameters


In [30]:
input_layer_size  = 400  # 20x20px digit image as 400 units input layer
hidden_layer_size = 25   # 25 units hidden layer
num_labels = 10          # 10 labels or units output layer

In [31]:
training_examples = sio.loadmat('ex4data1.mat')
X = training_examples['X']
y = training_examples['y'].transpose()[0]
m = len(X) # total training samples

Xb = np.column_stack([np.ones(m), X])       # X with bias unit
y_k = np.zeros((m, num_labels))
y_k[np.arange(m), y-1] = 1

In [32]:
X.shape # total training samples and input features


Out[32]:
(5000, 400)

In [33]:
y.shape # total training samples and output variables


Out[33]:
(5000,)

In [34]:
m


Out[34]:
5000

In [35]:
# Get precomputed parameters for testing
parameters = sio.loadmat('ex4weights.mat')
theta1 = parameters['Theta1']
theta2 = parameters['Theta2']
parameters = np.hstack([theta1.flatten(), theta2.flatten()])

In [36]:
theta1.shape # total hidden units and input units plus one bias unit


Out[36]:
(25, 401)

In [37]:
theta2.shape # total output units and hidden units plus one bias unit


Out[37]:
(10, 26)

Random Parameters Initialisation

Generate random parameters at the size of output by input plus one units with the value in between $-\epsilon_{init}$ and $\epsilon_{init}$. For learning efficiency, $\epsilon_{init}$ is setted to be 0.12.


In [38]:
def generate_random_parameters(L_in, L_out, epsilon_init=0.12):
    return np.random.rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init

In [39]:
epsilon_init = 0.12
initial_theta1 = generate_random_parameters(input_layer_size, hidden_layer_size, epsilon_init)
initial_theta2 = generate_random_parameters(hidden_layer_size, num_labels, epsilon_init)
initial_parameters = np.hstack([initial_theta1.flatten(), initial_theta2.flatten()])

In [40]:
initial_theta1.shape


Out[40]:
(25, 401)

In [41]:
initial_theta2.shape


Out[41]:
(10, 26)

In [42]:
initial_parameters.shape


Out[42]:
(10285,)

Check Cost (Feedforward)


In [43]:
input_layer_size  = 400  # 20x20px digit image as 400 units input layer
hidden_layer_size = 25   # 25 units hidden layer
num_labels = 10          # 10 labels or units output layer

training_examples = sio.loadmat('ex4data1.mat')
X = training_examples['X']
y = training_examples['y'].transpose()[0]
m = len(X) # total training samples

Xb = np.column_stack([np.ones(m), X])       # X with bias unit
y_k = np.zeros((m, num_labels))
y_k[np.arange(m), y-1] = 1

# Get precomputed parameters for testing
parameters = sio.loadmat('ex4weights.mat')
theta1 = parameters['Theta1']
theta2 = parameters['Theta2']
parameters = np.hstack([theta1.flatten(), theta2.flatten()])

Xb = np.column_stack([np.ones(m), X])       # X with bias unit
y_k = np.zeros((m, num_labels))
y_k[np.arange(m), y-1] = 1

In [44]:
lmda = 0 # regularisation weight (lambda) = 0
         # expected J = 0.287629

tic = time.time()
J, grad = compute_cost_grad(parameters, 
                            input_layer_size, hidden_layer_size, num_labels, 
                            Xb, y_k, m, lmda)
toc = time.time()
print('Runtime: %s seconds' % int(toc - tic))


Runtime: 0 seconds

In [45]:
J


Out[45]:
0.2876291651613187

In [46]:
grad


Out[46]:
array([  6.18712766e-05,   0.00000000e+00,   0.00000000e+00, ...,
         9.66104721e-05,  -7.57736846e-04,   7.73329872e-04])

In [47]:
lmda = 1 # regularisation weight (lambda) = 1
         # expected J = 0.383769

tic = time.time()
J, grad = compute_cost_grad(parameters, 
                            input_layer_size, hidden_layer_size, num_labels, 
                            Xb, y_k, m, lmda)
toc = time.time()
print('Runtime: %s seconds' % int(toc - tic))


Runtime: 0 seconds

In [48]:
J


Out[48]:
0.38376985909092343

In [49]:
grad


Out[49]:
array([  6.18712766e-05,  -2.11248326e-12,   4.38829369e-13, ...,
         4.70513145e-05,  -5.01718610e-04,   5.07825789e-04])

Check Gradient (Backpropagation)


In [50]:
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
lmda = 3 # regularisation weight (lambda)

Generate gradients using compute_cost_grad function.


In [51]:
def generate_debug_weight(fan_out, fan_in):
    w = np.zeros((fan_out, 1 + fan_in))
    return np.reshape(np.sin(range(w.size)), w.shape) / 10

In [52]:
# Generate dummy parameters
theta1 = generate_debug_weight(hidden_layer_size, input_layer_size)
theta2 = generate_debug_weight(num_labels, hidden_layer_size)
parameters = np.hstack([theta1.flatten(), theta2.flatten()])

# Generate dummy input
X = np.zeros((m, input_layer_size))
X = np.reshape(np.sin(range(X.size)), X.shape) / 10

y = 1 + np.mod(range(m), num_labels)

Xb = np.column_stack([np.ones(m), X])       # X with bias unit
y_k = np.zeros((m, num_labels))
y_k[np.arange(m), y-1] = 1

cost, grad = compute_cost_grad(parameters, 
                               input_layer_size, hidden_layer_size, num_labels, 
                               Xb, y_k, m, lmda)

Generate expected gradient using finite differences.


In [53]:
def compute_numerical_gradient(theta):
    numgrad = np.zeros(theta.shape)
    perturb = np.zeros(theta.shape)
    e = 1e-4

    for p in range(len(theta)):
        perturb[p] = e
        loss1, g = compute_cost_grad(theta - perturb, 
                                     input_layer_size, hidden_layer_size, num_labels, 
                                     Xb, y_k, m, lmda)
        loss2, g = compute_cost_grad(theta + perturb, 
                                     input_layer_size, hidden_layer_size, num_labels, 
                                     Xb, y_k, m, lmda)
        numgrad[p] = (loss2 - loss1) / (2*e)
        perturb[p] = 0
    
    return numgrad

In [54]:
numgrad = compute_numerical_gradient(parameters)

Compare our backpropagation gradient with expected numerical gradient.


In [55]:
grad, numgrad


Out[55]:
(array([ 0.00690654,  0.05037741,  0.0547017 ,  0.0087335 ,  0.0121251 ,
        -0.05757844, -0.01661326,  0.03962607,  0.00619848,  0.0247913 ,
        -0.03262097, -0.06004166, -0.00542356,  0.02532257,  0.05930642,
         0.03876422, -0.0120761 , -0.05762665, -0.04521974,  0.008762  ,
         0.10228635,  0.10144775,  0.10420847,  0.06174447,  0.00443674,
        -0.00682998,  0.09959317,  0.08965087,  0.10720022,  0.07664811,
         0.01633527, -0.01125638,  0.29693242,  0.17354254,  0.20209208,
         0.19520993,  0.1268967 ,  0.08866568]),
 array([ 0.00690654,  0.05037741,  0.0547017 ,  0.0087335 ,  0.0121251 ,
        -0.05757844, -0.01661326,  0.03962607,  0.00619848,  0.0247913 ,
        -0.03262097, -0.06004166, -0.00542356,  0.02532257,  0.05930642,
         0.03876422, -0.0120761 , -0.05762665, -0.04521974,  0.008762  ,
         0.10228635,  0.10144775,  0.10420847,  0.06174447,  0.00443674,
        -0.00682998,  0.09959317,  0.08965087,  0.10720022,  0.07664811,
         0.01633527, -0.01125638,  0.29693242,  0.17354254,  0.20209208,
         0.19520993,  0.1268967 ,  0.08866568]))

Visual comparison.


In [56]:
plt.scatter(range(len(numgrad)), numgrad, label='finite differences',
            color='red', marker="o", s=50)
plt.scatter(range(len(grad)), grad, label='backpropagation',
            color='blue', marker="+", s=50)

plt.xlabel('#parameter')
plt.ylabel('gradient')
plt.legend()

plt.show()


Numerical comparison. Good result should be less than $10^{-9}$.


In [59]:
np.linalg.norm(numgrad-grad) / np.linalg.norm(numgrad+grad) # Should be less then 1e-9


Out[59]:
1.7663907467781258e-11

Train Neural Network


In [60]:
input_layer_size  = 400  # 20x20px digit image as 400 units input layer
hidden_layer_size = 25   # 25 units hidden layer
num_labels = 10          # 10 labels or units output layer

training_examples = sio.loadmat('ex4data1.mat')
X = training_examples['X']
y = training_examples['y'].transpose()[0]
m = len(X) # total training samples
lmda = 1 # regularisation weight (lambda)

Xb = np.column_stack([np.ones(m), X])       # X with bias unit
y_k = np.zeros((m, num_labels))
y_k[np.arange(m), y-1] = 1

tic = time.time()
parameters_result = minimize(fun=compute_cost_grad, x0=initial_parameters, 
                             method='Newton-CG', jac =True, 
                             args=(input_layer_size, hidden_layer_size, num_labels, 
                                   Xb, y_k, m, lmda),
                             options={'maxiter': 50})
toc = time.time()
print('Runtime: %s seconds' % int(toc - tic))


Runtime: 181 seconds

In [61]:
trained_parameters = parameters_result['x']

In [62]:
theta1_size = hidden_layer_size * (input_layer_size + 1)
theta1 = trained_parameters[:theta1_size].reshape(hidden_layer_size, input_layer_size + 1)
theta2 = trained_parameters[theta1_size:].reshape(num_labels, hidden_layer_size + 1)

Visualize Weights


In [63]:
def display_weights(weights):
    ttl_subplots = 5
    f, axarr = plt.subplots(ttl_subplots, ttl_subplots, figsize=(8, 8))
    for pi in range(ttl_subplots):
        for pj in range(ttl_subplots):
            i = pi * 5 + pj
            axarr[pi, pj].imshow(np.resize(weights[i], (20,20)).transpose(), 
                                 cmap = cm.Greys_r)
            axarr[pi, pj].axes.get_xaxis().set_visible(False)
            axarr[pi, pj].axes.get_yaxis().set_visible(False)
            axarr[pi, pj].axes.get_xaxis().set_ticks([])
            axarr[pi, pj].axes.get_yaxis().set_ticks([])
    plt.tight_layout()
    plt.show()

In [64]:
display_weights(theta1[:,1:])


Make Prediction


In [65]:
def display_image(i, size=5):
    plt.figure(figsize = (size,size))
    plt.imshow(np.resize(X[i], (20,20)).transpose(), cmap = cm.Greys_r)
    plt.axis('off')
    plt.show()

In [66]:
def compute_feedforward(x, theta1, theta2):
    h1 = expit(np.dot(np.hstack([1, x]), theta1.transpose()))
    h2 = expit(np.dot(np.hstack([1, h1]), theta2.transpose()))
    return h2

In [67]:
def plot_output(h):
    plt.bar(range(num_labels), h, color='gray',  width=0.7, align="center")
    plt.xticks(range(num_labels), list(map(str, list(range(1, num_labels)) + [0])))
    plt.ylim([0.0,1.0])
    plt.xlabel('label')
    plt.ylabel('h')
    plt.show()

In [68]:
input_layer_size  = 400  # 20x20px digit image as 400 units input layer
hidden_layer_size = 25   # 25 units hidden layer
num_labels = 10          # 10 labels or units output layer

training_examples = sio.loadmat('ex4data1.mat')
X = training_examples['X']
y = training_examples['y'].transpose()[0]
m = len(X) # total training samples

In [69]:
i = randint(0, m)
display_image(i, 5)
print("Expected label: %d" % y[i])


Expected label: 3

In [70]:
h2 = compute_feedforward(X[i], theta1, theta2)
plot_output(h2)


Calculate Accuracy


In [71]:
# Get prediction labels
a2 = expit(np.dot(Xb, theta1.transpose()))
a2b = np.column_stack([np.ones(m), a2])     # a2 with bias unit
h_theta = expit(np.dot(a2b, theta2.transpose()))

# Compare predicted labels to the given labels and calculate the accuracy
prediction_labels = np.argmax(h_theta, axis=1)+1
accuracy = np.mean(prediction_labels == y) * 100
print("Training accuracy: %.2f%%" % accuracy)


Training accuracy: 99.66%

Display Misprediction


In [72]:
mispredicted_indices = np.where(prediction_labels != y)[0]
mispredicted_indices


Out[72]:
array([ 142,  265,  765, 1440, 1617, 1701, 1764, 2112, 2171, 2506, 3043,
       3382, 3895, 4593, 4636, 4863, 4990])

In [78]:
for mispredicted_indiex in mispredicted_indices:
    display_image(mispredicted_indiex, 2)
    mispredicted_h2 = compute_feedforward(X[mispredicted_indiex], theta1, theta2)
    plot_output(mispredicted_h2)
    print("Predicted %d should be %d\n\n" % (prediction_labels[mispredicted_indiex], y[mispredicted_indiex]))


Predicted 8 should be 10


Predicted 4 should be 10


Predicted 2 should be 1


Predicted 1 should be 2


Predicted 4 should be 3


Predicted 7 should be 3


Predicted 7 should be 3


Predicted 1 should be 4


Predicted 9 should be 4


Predicted 3 should be 5


Predicted 10 should be 6


Predicted 10 should be 6


Predicted 3 should be 7


Predicted 10 should be 9


Predicted 3 should be 9


Predicted 7 should be 9


Predicted 3 should be 9



References

  1. Y. LeCun, C. Cortes, C.J.C. Burges, 2015. THE MNIST DATABASE of handwritten digits. http://yann.lecun.com/exdb/mnist/
  2. Andrew Y. Ng, 2015. Machine Learning. Week 3: Logistic Regression. Coursera. https://www.coursera.org/learn/machine-learning
  3. Andrew Y. Ng, 2015. Machine Learning. Week 4 and 5: Neural Networks. Coursera. https://www.coursera.org/learn/machine-learning