Neural Networks!!!

for non-linearities, learn non-linear hypotheses by learning features

import numpy as np
import scipy as sp
from scipy import special
import scipy.optimize as op
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline


Artificial Neural Network: Simple Neuron Model, Logistic Unit

from IPython.display import Image


input wires ("dendrites") -> body of neuron -> output ("axon")

$h_{\theta}(x) = g(x) = \frac{1}{1+e^{-\theta^{T}x}}$
(thus, a simulated neuron with a sigmoid (logistic) activation function)

$x = \begin{bmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\end{bmatrix}$     $\theta = \begin{bmatrix}\theta_{0}\\\theta_{1}\\\theta_{2}\\\theta_{3}\end{bmatrix}$

bias unit: $x_{0} = 1$

weights: $\theta$

simple neural network architecture

from IPython.display import Image


A neural network is a group of these single neurons strung together

Layer 1: Input Layer

Layer 2: Hidden Layer

Layer 3: Output Layer

$a_{i}^{(j)} = $ "activation" of neuron/unit $i$ in layer $j$

$\Theta^{(j)} = $ matrix of weights controlling function mapping from layer $j$ to layer $j + 1$

sigmoid/logistic activation function applied to linear combinations of inputs

$a_{1}^{(2)} = g(\Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3})$

$a_{2}^{(2)} = g(\Theta_{20}^{(1)}x_{0} + \Theta_{21}^{(1)}x_{1} + \Theta_{22}^{(1)}x_{2} + \Theta_{23}^{(1)}x_{3})$

$a_{3}^{(2)} = g(\Theta_{30}^{(1)}x_{0} + \Theta_{31}^{(1)}x_{1} + \Theta_{32}^{(1)}x_{2} + \Theta_{33}^{(1)}x_{3})$

$h_{\Theta}(x) = a_{1}^{(3)} = g(\Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{1}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)})$

for 3 units, 3 hidden units, $\Theta^{(1)} \in \mathbb{R}^{3\times4}$

if network has $s_{j}$ units in layer $j$, $s_{j + 1}$ units in layer $j + 1$, then $\Theta^{(j)}$ will be of dimension $s_{j + 1} \times (s_{j} + 1)$.

from above example:

$\Theta^{(j = 1)} = \Theta^{(1)}$:

$s_{j = 1} = s_{1} = 3$ units in layer $j = 1$

$s_{j + 1 = 1 + 1 = 2} = s_{2} = 3$ units in layer $j + 1 = 1 + 1 = 2$

$\rightarrow \Theta^{(1)} \in \mathbb{R}^{s_{j + 1 = 1 + 1 = 2} = s_{2} = 3 \times (s_{j = 1} = s_{1} = 3 + 1 = 4) = 3 \times 4} = \mathbb{R}^{3\times4}$

note: for example, $\Theta^{(2)}$ should be interpreted as the matrix of parameters/weights that controls the function that maps from the hidden units (in layer 2) to the one (layer 3) output unit

while the above artifical neural network defines a function $h$ that hopefully maps inputs $x$ to some space of predictions $y$; as $\Theta$ is varied, different $h$'s result

(feed-)forward propagation: vectorized implementation

$z$ values are weighted linear combinations of $x$ values that go into a particular neuron/activation unit

for example:
$z_{1}^{(2)} = \Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3}$

$a_{1}^{(2)} = g(\Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3}) = g(z_{1}^{(2)})$

$a_{2}^{(2)} = g(\Theta_{20}^{(1)}x_{0} + \Theta_{21}^{(1)}x_{1} + \Theta_{22}^{(1)}x_{2} + \Theta_{23}^{(1)}x_{3}) = g(z_{2}^{(2)})$

$a_{3}^{(2)} = g(\Theta_{30}^{(1)}x_{0} + \Theta_{31}^{(1)}x_{1} + \Theta_{32}^{(1)}x_{2} + \Theta_{33}^{(1)}x_{3}) = g(z_{3}^{(2)})$

$h_{\Theta}(x) = a_{1}^{(3)} = g(\Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{1}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)})$

$x = \begin{bmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\end{bmatrix}$     $z^{(2)} = \begin{bmatrix}z_{1}^{(2)}\\z_{2}^{(2)}\\z_{3}^{(2)}\end{bmatrix}$

$z$ is a 3-dimensional vector (i.e., $z \in \mathbb{R}^{3}$), as is $a^{(2)}$

$z^{(2)} = \Theta^{(1)}x$

$a^{(2)} = g(z^{(2)})$

$g$ applies the sigmoid function element-wise to each of $z^{(2)}$'s elements

if $a^{(1)}$ can be defined as activations of the first (input) layer, then $a^{(1)} = x$

thus: $z^{(2)} = \Theta^{(1)}a^{(1)}$

add bias unit $a_{0}^{(2)} = 1$, making $a^{(2)} \in \mathbb{R}^4$

$z^{(3)} = \Theta^{(2)}a^{(2)}$

$h_{\Theta}(x) = a^{(3)} = g(z^{(3)})$

process of computing $h_{\Theta}(x)$ is called forward propagation because the activations are forward-propagated to the hidden layer, the activations of which are then forward-propagated to the output layer

large negative weights to negate variables

Simple example: AND function

$x_{1},x_{2} \in \{0,1\}$

$y = x_{1} \wedge x_{2}$

$\Theta^{(1)} = \begin{bmatrix}-30\\20\\20\end{bmatrix}$

$\Theta_{10}^{(1)} = -30$

$\Theta_{11}^{(1)} = 20$

$\Theta_{12}^{(1)} = 20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(-30 + 20x_{1} + 20x_{2})$

$x_{1}$$x_{2}$$x_{1} \wedge x_{2} \approx h_{\Theta}(x)$
00$0 \approx g\big[-30(1) + 20(0) + 20(0) \big] = g(-30)$
01$0 \approx g\big[-30(1) + 20(0) + 20(1) \big] = g(-10)$
10$0 \approx g\big[-30(1) + 20(1) + 20(0) \big] = g(-10)$
11$1 \approx g\big[-30(1) + 20(1) + 20(1) \big] = g(10)$

from IPython.display import Image
Image(filename='images/AND.png', width=400)


# hypothesis, sigmoid function
def g(z):
    return special.expit(z)

x = np.arange(-10,10)

Simple example: Negation function

$x_{1} \in \{0,1\}$

$y = \bar{x}_{1}$

$\Theta^{(1)} = \begin{bmatrix}10\\-20\end{bmatrix}$

$\Theta_{10}^{(1)} = 10$

$\Theta_{11}^{(1)} = -20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1}) = g(10 + -20x_{1})$

$x_{1}$$\bar{x}_{1} \approx h_{\Theta}(x)$
0$1 \approx g\big[10(1) + -20(0) \big] = g(10)$
1$0 \approx g\big[10(1) + -20(1) \big] = g(-10)$

Simple example: (NOT $x_{1}$) AND (NOT $x_{2}$) function

$x_{1},x_{2} \in \{0,1\}$

$y = \bar{x}_{1} \wedge \bar{x}_{2}$

$\Theta^{(1)} = \begin{bmatrix}10\\-20\\-20\end{bmatrix}$

$\Theta_{10}^{(1)} = 10$

$\Theta_{11}^{(1)} = -20$

$\Theta_{12}^{(1)} = -20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(10 + -20x_{1} + -20x_{2})$

$x_{1}$$x_{2}$$\bar{x}_{1}$$\bar{x}_{2}$$\bar{x}_{1} \wedge \bar{x}_{2} \approx h_{\Theta}(x)$
0011$1 \approx g\big[10(1) + -20(0) + -20(0) \big] = g(10)$
0110$0 \approx g\big[10(1) + -20(0) + -20(1) \big] = g(-10)$
1001$0 \approx g\big[10(1) + -20(1) + -20(0) \big] = g(-10)$
1100$0 \approx g\big[10(1) + -20(1) + -20(1) \big] = g(-30)$

Simple example: OR function

$x_{1},x_{2} \in \{0,1\}$

$y = x_{1} \vee x_{2}$

$\Theta^{(1)} = \begin{bmatrix}-10\\20\\20\end{bmatrix}$

$\Theta_{10}^{(1)} = -10$

$\Theta_{11}^{(1)} = 20$

$\Theta_{12}^{(1)} = 20$

$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(-10 + 20x_{1} + 20x_{2})$

$x_{1}$$x_{2}$$x_{1} \vee x_{2} \approx h_{\Theta}(x)$
00$0 \approx g\big[-10(1) + 20(0) + 20(0) \big] = g(-10)$
01$1 \approx g\big[-10(1) + 20(0) + 20(1) \big] = g(10)$
10$1 \approx g\big[-10(1) + 20(1) + 20(0) \big] = g(10)$
11$1 \approx g\big[-10(1) + 20(1) + 20(1) \big] = g(30)$

Non-linear Decision Boundaries Mean More Complex Classification

Additional layers allows more complex functions

Example: XNOR

$x_{1},x_{2}$ are binary ($0$ or $1$)

ax = plt.subplot(111)
plt.scatter([0,1],[0,1], s=100, marker='x',c='red')
plt.scatter([1,0],[0,1], s=100, marker='o',facecolors='none', edgecolors='blue')

Neural Network with an input layer, a hidden layer, and an output layer

from IPython.display import Image
Image(filename='images/XNOR.png', width=500)


$x_{1},x_{2} \in \{0,1\}$

$y = \big(x_{1} \oplus x_{2}\big)'$

$x_{1}$$x_{2}$$a_{1}^{(2)} = x_{1} \wedge x_{2}$$a_{2}^{(2)} = \bar{x}_{1} \wedge \bar{x}_{2}$$a_{1}^{(2)} \vee a_{2}^{(2)} \approx h_{\Theta}(x)$

Multiclass Classification

multiple output units: one-vs-all

$h_{\Theta}(x) \in \mathbb{R}^{4}$


when pedestrian, $h_{\Theta}(x) \approx \begin{bmatrix}1\\0\\0\\0\end{bmatrix}$

when car, $h_{\Theta}(x) \approx \begin{bmatrix}0\\1\\0\\0\end{bmatrix}$

when motorcycle, $h_{\Theta}(x) \approx \begin{bmatrix}0\\0\\1\\0\end{bmatrix}$

when truck, $h_{\Theta}(x) \approx \begin{bmatrix}0\\0\\0\\1\end{bmatrix}$

training set: $(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), ... , (x^{(m)},y^{(m)})$

$y^{(i)}$ is one of the vectors above depending on the corresponding image $x^{(i)}$

so one training example will be one pair $(x^{(i)},y^{(i)})$

attempt to find a way for NN to output values so that $h_{\Theta}(x) \approx y^{(i)}$

MNIST data

5000 training examples, where each example is a 20 pixel by 20 pixel grayscale image of a handwritten digit (0–9)

each pixel is represented by a floating-point number indicating the grayscale intensity at that pixel location

thus, each training example in the data set is a 400-dimensional vector, which becomes a single row in a $5000 \times 400$ matrix $X$

$X = \begin{bmatrix}--- (x^{(1)})^{T} ---\\--- (x^{(2)})^{T} ---\\.\\.\\.\\--- (x^{(m)})^{T} ---\end{bmatrix}$

$y$ is a 5000-dimensional vector that contains labels for the training set

X = np.loadtxt('X.csv', delimiter=',')
y = np.loadtxt('y.csv', delimiter=',')
x = X[4999]
x.shape = (20,20)
print x.shape

indexes = np.random.random_integers(0, high=4999, size=100)
from mpl_toolkits.axes_grid1 import ImageGrid

im = np.arange(100)
im.shape = 10, 10

fig = plt.figure()
grid = ImageGrid(fig, 111, # similar to subplot(111)
                nrows_ncols = (10, 10), 
                axes_pad=0, # pad between axes in inch.

for i in range(100):
    grid[i].imshow(X[indexes[i]].reshape((20,20)),, interpolation='nearest')

Vectorizing Logistic Regression

use multiple one-vs-all logistic regression models to build a multi-class classifier

since 10 classes, need to train 10 separate logistic regression classifiers

$X = \begin{bmatrix}--- (x^{(1)})^{T} ---\\--- (x^{(2)})^{T} ---\\.\\.\\.\\--- (x^{(m)})^{T} ---\end{bmatrix}$

$\theta = \begin{bmatrix}\theta_{0}\\\theta_{1}\\.\\.\\.\\\theta_{n}\end{bmatrix}$

matrix product ($a^{T}b = b^{T}a$, if $a$ and $b$ are vectors)
$X\theta = \begin{bmatrix}--- (x^{(1)})^{T}\theta ---\\--- (x^{(2)})^{T}\theta ---\\.\\.\\.\\--- (x^{(m)})^{T}\theta ---\end{bmatrix} = \begin{bmatrix}--- \theta^{T}(x^{(1)}) ---\\--- \theta^{T}(x^{(2)}) ---\\.\\.\\.\\--- \theta^{T}(x^{(m)}) ---\end{bmatrix}$

for each $i$, calculate $\theta^{T}(x^{(i)})$

unregularized cost function for logistic regression:
$J(\theta) = \frac{1}{m}\sum\limits_{i=1}^m\big[-y^{(i)}\log\big(h_{\theta}(x^{(i)})\big)-(1-y^{(i)})\log\big(1 - h_{\theta}(x^{(i)})\big)\big]$

def g(z):
    return sp.special.expit(z)

def J(m,y,X,theta):
    matrix_product = np.multiply(X,theta)
    diff = (-1 * y).dot(np.log(g(matrix_product))) - (1 - y).dot(np.log(1 - g(matrix_product)))
    summation = np.sum(diff)
    return 1.0/m * summation

theta = np.zeros((y.shape[0],1))


Vectorizing the Gradient

the gradient of the unregularized logistic regression cost is a vector where the $j$-th element is defined as:
$\frac{\partial{J}}{\partial{\theta_{j}}} = \frac{1}{m}\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\big)$

all partial derivatives for $\theta_{j}$:
$\begin{bmatrix} \frac{\partial{J}}{\partial{\theta_{0}}}\\ \frac{\partial{J}}{\partial{\theta_{1}}}\\ \frac{\partial{J}}{\partial{\theta_{2}}}\\ \vdots\\ \frac{\partial{J}}{\partial{\theta_{n}}} \end{bmatrix}=\frac{1}{m} \begin{bmatrix} \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\big)\\ \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{1}^{(i)}\big)\\ \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{2}^{(i)}\big)\\ \vdots\\ \sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{n}^{(i)}\big) \end{bmatrix}$



where $h_{\theta}(x)-y = \begin{bmatrix} h_{\theta}(x^{(1)})-y^{(1)}\\ h_{\theta}(x^{(2)})-y^{(2)}\\ \vdots\\ h_{\theta}(x^{(m)})-y^{(m)}\\ \end{bmatrix}$

$x^{(i)}$ is a vector
$h_{\theta}(x^{(i)})-y^{(i)}$ is a scalar