Neural Networks!!!

for non-linearities, learn non-linear hypotheses by learning features


In [18]:
import numpy as np
import scipy as sp
from scipy import special
import scipy.optimize as op
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

#help(np.arange)
#np.arange?

Artificial Neural Network: Simple Neuron Model, Logistic Unit


In [19]:
from IPython.display import Image
Image(filename='images/neuron_model.png')


Out[19]:

input wires ("dendrites") -> body of neuron -> output ("axon")

$h_{\theta}(x) = g(x) = \frac{1}{1+e^{-\theta^{T}x}}$
(thus, a simulated neuron with a sigmoid (logistic) activation function)

$x = \begin{bmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\end{bmatrix}$     $\theta = \begin{bmatrix}\theta_{0}\\\theta_{1}\\\theta_{2}\\\theta_{3}\end{bmatrix}$

bias unit: $x_{0} = 1$

weights: $\theta$

simple neural network architecture


In [20]:
from IPython.display import Image
Image(filename='images/neural_net.png')


Out[20]:

A neural network is a group of these single neurons strung together

Layer 1: Input Layer

Layer 2: Hidden Layer

Layer 3: Output Layer

$a_{i}^{(j)} = $ "activation" of neuron/unit $i$ in layer $j$

$\Theta^{(j)} = $ matrix of weights controlling function mapping from layer $j$ to layer $j + 1$

sigmoid/logistic activation function applied to linear combinations of inputs

$a_{1}^{(2)} = g(\Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3})$

$a_{2}^{(2)} = g(\Theta_{20}^{(1)}x_{0} + \Theta_{21}^{(1)}x_{1} + \Theta_{22}^{(1)}x_{2} + \Theta_{23}^{(1)}x_{3})$

$a_{3}^{(2)} = g(\Theta_{30}^{(1)}x_{0} + \Theta_{31}^{(1)}x_{1} + \Theta_{32}^{(1)}x_{2} + \Theta_{33}^{(1)}x_{3})$

$h_{\Theta}(x) = a_{1}^{(3)} = g(\Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{1}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)})$

for 3 units, 3 hidden units, $\Theta^{(1)} \in \mathbb{R}^{3\times4}$

if network has $s_{j}$ units in layer $j$, $s_{j + 1}$ units in layer $j + 1$, then $\Theta^{(j)}$ will be of dimension $s_{j + 1} \times (s_{j} + 1)$.

from above example:

$\Theta^{(j = 1)} = \Theta^{(1)}$:

$s_{j = 1} = s_{1} = 3$ units in layer $j = 1$

$s_{j + 1 = 1 + 1 = 2} = s_{2} = 3$ units in layer $j + 1 = 1 + 1 = 2$

$\rightarrow \Theta^{(1)} \in \mathbb{R}^{s_{j + 1 = 1 + 1 = 2} = s_{2} = 3 \times (s_{j = 1} = s_{1} = 3 + 1 = 4) = 3 \times 4} = \mathbb{R}^{3\times4}$

note: for example, $\Theta^{(2)}$ should be interpreted as the matrix of parameters/weights that controls the function that maps from the hidden units (in layer 2) to the one (layer 3) output unit

while the above artifical neural network defines a function $h$ that hopefully maps inputs $x$ to some space of predictions $y$; as $\Theta$ is varied, different $h$'s result

(feed-)forward propagation: vectorized implementation

$z$ values are weighted linear combinations of $x$ values that go into a particular neuron/activation unit

for example:
$z_{1}^{(2)} = \Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3}$

thus:
$a_{1}^{(2)} = g(\Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3}) = g(z_{1}^{(2)})$

$a_{2}^{(2)} = g(\Theta_{20}^{(1)}x_{0} + \Theta_{21}^{(1)}x_{1} + \Theta_{22}^{(1)}x_{2} + \Theta_{23}^{(1)}x_{3}) = g(z_{2}^{(2)})$

$a_{3}^{(2)} = g(\Theta_{30}^{(1)}x_{0} + \Theta_{31}^{(1)}x_{1} + \Theta_{32}^{(1)}x_{2} + \Theta_{33}^{(1)}x_{3}) = g(z_{3}^{(2)})$

$h_{\Theta}(x) = a_{1}^{(3)} = g(\Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{1}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)})$

$x = \begin{bmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\end{bmatrix}$     $z^{(2)} = \begin{bmatrix}z_{1}^{(2)}\\z_{2}^{(2)}\\z_{3}^{(2)}\end{bmatrix}$

$z$ is a 3-dimensional vector (i.e., $z \in \mathbb{R}^{3}$), as is $a^{(2)}$

$z^{(2)} = \Theta^{(1)}x$

$a^{(2)} = g(z^{(2)})$

$g$ applies the sigmoid function element-wise to each of $z^{(2)}$'s elements

if $a^{(1)}$ can be defined as activations of the first (input) layer, then $a^{(1)} = x$

thus: $z^{(2)} = \Theta^{(1)}a^{(1)}$

add bias unit $a_{0}^{(2)} = 1$, making $a^{(2)} \in \mathbb{R}^4$

$z^{(3)} = \Theta^{(2)}a^{(2)}$

$h_{\Theta}(x) = a^{(3)} = g(z^{(3)})$

process of computing $h_{\Theta}(x)$ is called forward propagation because the activations are forward-propagated to the hidden layer, the activations of which are then forward-propagated to the output layer


large negative weights to negate variables

Simple example: AND function

$x_{1},x_{2} \in \{0,1\}$

$y = x_{1} \wedge x_{2}$

$\Theta^{(1)} = \begin{bmatrix}-30\\20\\20\end{bmatrix}$

$\Theta_{10}^{(1)} = -30$

$\Theta_{11}^{(1)} = 20$

$\Theta_{12}^{(1)} = 20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(-30 + 20x_{1} + 20x_{2})$

$x_{1}$$x_{2}$$x_{1} \wedge x_{2} \approx h_{\Theta}(x)$
00$0 \approx g\big[-30(1) + 20(0) + 20(0) \big] = g(-30)$
01$0 \approx g\big[-30(1) + 20(0) + 20(1) \big] = g(-10)$
10$0 \approx g\big[-30(1) + 20(1) + 20(0) \big] = g(-10)$
11$1 \approx g\big[-30(1) + 20(1) + 20(1) \big] = g(10)$

In [21]:
from IPython.display import Image
Image(filename='images/AND.png', width=400)


Out[21]:

In [22]:
# hypothesis, sigmoid function
def g(z):
    return special.expit(z)

x = np.arange(-10,10)
plt.plot(x,g(x))
plt.show()


Simple example: Negation function

$x_{1} \in \{0,1\}$

$y = \bar{x}_{1}$

$\Theta^{(1)} = \begin{bmatrix}10\\-20\end{bmatrix}$

$\Theta_{10}^{(1)} = 10$

$\Theta_{11}^{(1)} = -20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1}) = g(10 + -20x_{1})$

$x_{1}$$\bar{x}_{1} \approx h_{\Theta}(x)$
0$1 \approx g\big[10(1) + -20(0) \big] = g(10)$
1$0 \approx g\big[10(1) + -20(1) \big] = g(-10)$

Simple example: (NOT $x_{1}$) AND (NOT $x_{2}$) function

$x_{1},x_{2} \in \{0,1\}$

$y = \bar{x}_{1} \wedge \bar{x}_{2}$

$\Theta^{(1)} = \begin{bmatrix}10\\-20\\-20\end{bmatrix}$

$\Theta_{10}^{(1)} = 10$

$\Theta_{11}^{(1)} = -20$

$\Theta_{12}^{(1)} = -20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(10 + -20x_{1} + -20x_{2})$

$x_{1}$$x_{2}$$\bar{x}_{1}$$\bar{x}_{2}$$\bar{x}_{1} \wedge \bar{x}_{2} \approx h_{\Theta}(x)$
0011$1 \approx g\big[10(1) + -20(0) + -20(0) \big] = g(10)$
0110$0 \approx g\big[10(1) + -20(0) + -20(1) \big] = g(-10)$
1001$0 \approx g\big[10(1) + -20(1) + -20(0) \big] = g(-10)$
1100$0 \approx g\big[10(1) + -20(1) + -20(1) \big] = g(-30)$

Simple example: OR function

$x_{1},x_{2} \in \{0,1\}$

$y = x_{1} \vee x_{2}$

$\Theta^{(1)} = \begin{bmatrix}-10\\20\\20\end{bmatrix}$

$\Theta_{10}^{(1)} = -10$

$\Theta_{11}^{(1)} = 20$

$\Theta_{12}^{(1)} = 20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(-10 + 20x_{1} + 20x_{2})$

$x_{1}$$x_{2}$$x_{1} \vee x_{2} \approx h_{\Theta}(x)$
00$0 \approx g\big[-10(1) + 20(0) + 20(0) \big] = g(-10)$
01$1 \approx g\big[-10(1) + 20(0) + 20(1) \big] = g(10)$
10$1 \approx g\big[-10(1) + 20(1) + 20(0) \big] = g(10)$
11$1 \approx g\big[-10(1) + 20(1) + 20(1) \big] = g(30)$

Non-linear Decision Boundaries Mean More Complex Classification

Additional layers allows more complex functions

Example: XNOR

$x_{1},x_{2}$ are binary ($0$ or $1$)


In [23]:
ax = plt.subplot(111)
ax.spines['left'].set_position('zero')
ax.spines['bottom'].set_position('zero')
plt.scatter([0,1],[0,1], s=100, marker='x',c='red')
plt.scatter([1,0],[0,1], s=100, marker='o',facecolors='none', edgecolors='blue')
plt.xticks([0,1])
plt.yticks([0,1])
plt.show()


Neural Network with an input layer, a hidden layer, and an output layer


In [24]:
from IPython.display import Image
Image(filename='images/XNOR.png', width=500)


Out[24]:

$x_{1},x_{2} \in \{0,1\}$

$y = \big(x_{1} \oplus x_{2}\big)'$

$x_{1}$$x_{2}$$a_{1}^{(2)} = x_{1} \wedge x_{2}$$a_{2}^{(2)} = \bar{x}_{1} \wedge \bar{x}_{2}$$a_{1}^{(2)} \vee a_{2}^{(2)} \approx h_{\Theta}(x)$
00011
01000
10000
11101

Multiclass Classification

multiple output units: one-vs-all

$h_{\Theta}(x) \in \mathbb{R}^{4}$

want:

when pedestrian, $h_{\Theta}(x) \approx \begin{bmatrix}1\\0\\0\\0\end{bmatrix}$

when car, $h_{\Theta}(x) \approx \begin{bmatrix}0\\1\\0\\0\end{bmatrix}$

when motorcycle, $h_{\Theta}(x) \approx \begin{bmatrix}0\\0\\1\\0\end{bmatrix}$

when truck, $h_{\Theta}(x) \approx \begin{bmatrix}0\\0\\0\\1\end{bmatrix}$

training set: $(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), ... , (x^{(m)},y^{(m)})$

$y^{(i)}$ is one of the vectors above depending on the corresponding image $x^{(i)}$

so one training example will be one pair $(x^{(i)},y^{(i)})$

attempt to find a way for NN to output values so that $h_{\Theta}(x) \approx y^{(i)}$


MNIST data

5000 training examples, where each example is a 20 pixel by 20 pixel grayscale image of a handwritten digit (0–9)

each pixel is represented by a floating-point number indicating the grayscale intensity at that pixel location

thus, each training example in the data set is a 400-dimensional vector, which becomes a single row in a $5000 \times 400$ matrix $X$

$X = \begin{bmatrix}--- (x^{(1)})^{T} ---\\--- (x^{(2)})^{T} ---\\.\\.\\.\\--- (x^{(m)})^{T} ---\end{bmatrix}$

$y$ is a 5000-dimensional vector that contains labels for the training set


In [25]:
X = np.loadtxt('X.csv', delimiter=',')
y = np.loadtxt('y.csv', delimiter=',')
x = X[4999]
x.shape = (20,20)
print x.shape
plt.matshow(x, cmap=mpl.cm.gray_r)
plt.show()


(20, 20)

In [26]:
indexes = np.random.random_integers(0, high=4999, size=100)
from mpl_toolkits.axes_grid1 import ImageGrid

im = np.arange(100)
im.shape = 10, 10

fig = plt.figure()
grid = ImageGrid(fig, 111, # similar to subplot(111)
                nrows_ncols = (10, 10), 
                axes_pad=0, # pad between axes in inch.
                )

ax.set_axis_off()
plt.axis('off')
for i in range(100):
    grid[i].imshow(X[indexes[i]].reshape((20,20)), cmap=mpl.cm.gray_r, interpolation='nearest')


plt.show()


Vectorizing Logistic Regression

use multiple one-vs-all logistic regression models to build a multi-class classifier

since 10 classes, need to train 10 separate logistic regression classifiers

$X = \begin{bmatrix}--- (x^{(1)})^{T} ---\\--- (x^{(2)})^{T} ---\\.\\.\\.\\--- (x^{(m)})^{T} ---\end{bmatrix}$

$\theta = \begin{bmatrix}\theta_{0}\\\theta_{1}\\.\\.\\.\\\theta_{n}\end{bmatrix}$

matrix product ($a^{T}b = b^{T}a$, if $a$ and $b$ are vectors)
$X\theta = \begin{bmatrix}--- (x^{(1)})^{T}\theta ---\\--- (x^{(2)})^{T}\theta ---\\.\\.\\.\\--- (x^{(m)})^{T}\theta ---\end{bmatrix} = \begin{bmatrix}--- \theta^{T}(x^{(1)}) ---\\--- \theta^{T}(x^{(2)}) ---\\.\\.\\.\\--- \theta^{T}(x^{(m)}) ---\end{bmatrix}$

for each $i$, calculate $\theta^{T}(x^{(i)})$

unregularized cost function for logistic regression:
$J(\theta) = \frac{1}{m}\sum\limits_{i=1}^m\big[-y^{(i)}\log\big(h_{\theta}(x^{(i)})\big)-(1-y^{(i)})\log\big(1 - h_{\theta}(x^{(i)})\big)\big]$


In [27]:
def g(z):
    return sp.special.expit(z)

def J(m,y,X,theta):
    matrix_product = np.multiply(X,theta)
    diff = (-1 * y).dot(np.log(g(matrix_product))) - (1 - y).dot(np.log(1 - g(matrix_product)))
    summation = np.sum(diff)
    return 1.0/m * summation

In [28]:
theta = np.zeros((y.shape[0],1))
J(X.shape[0],y,X,theta)


Out[28]:
277.2588722239845

Vectorizing the Gradient

the gradient of the unregularized logistic regression cost is a vector where the $j$-th element is defined as:
$\frac{\partial{J}}{\partial{\theta_{j}}} = \frac{1}{m}\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\big)$

all partial derivatives for $\theta_{j}$:
$\begin{bmatrix} \frac{\partial{J}}{\partial{\theta_{0}}}\\ \frac{\partial{J}}{\partial{\theta_{1}}}\\ \frac{\partial{J}}{\partial{\theta_{2}}}\\ \vdots\\ \frac{\partial{J}}{\partial{\theta_{n}}} \end{bmatrix}=\frac{1}{m} \begin{bmatrix} \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\big)\\ \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{1}^{(i)}\big)\\ \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{2}^{(i)}\big)\\ \vdots\\ \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{n}^{(i)}\big) \end{bmatrix}$

$=\frac{1}{m}\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\big)$

$=\frac{1}{m}X^{T}(h_{\theta}(x)-y)$

where $h_{\theta}(x)-y = \begin{bmatrix} h_{\theta}(x^{(1)})-y^{(1)}\\ h_{\theta}(x^{(2)})-y^{(2)}\\ \vdots\\ h_{\theta}(x^{(m)})-y^{(m)}\\ \end{bmatrix}$

note:
$x^{(i)}$ is a vector
$h_{\theta}(x^{(i)})-y^{(i)}$ is a scalar