In [18]:
import numpy as np
import scipy as sp
from scipy import special
import scipy.optimize as op
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
#help(np.arange)
#np.arange?
In [19]:
from IPython.display import Image
Image(filename='images/neuron_model.png')
Out[19]:
input wires ("dendrites") -> body of neuron -> output ("axon")
$h_{\theta}(x) = g(x) = \frac{1}{1+e^{-\theta^{T}x}}$
(thus, a simulated neuron with a sigmoid (logistic) activation function)
$x = \begin{bmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\end{bmatrix}$ $\theta = \begin{bmatrix}\theta_{0}\\\theta_{1}\\\theta_{2}\\\theta_{3}\end{bmatrix}$
bias unit: $x_{0} = 1$
weights: $\theta$
In [20]:
from IPython.display import Image
Image(filename='images/neural_net.png')
Out[20]:
A neural network is a group of these single neurons strung together
Layer 1: Input Layer
Layer 2: Hidden Layer
Layer 3: Output Layer
$a_{i}^{(j)} = $ "activation" of neuron/unit $i$ in layer $j$
$\Theta^{(j)} = $ matrix of weights controlling function mapping from layer $j$ to layer $j + 1$
sigmoid/logistic activation function applied to linear combinations of inputs
$a_{1}^{(2)} = g(\Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3})$
$a_{2}^{(2)} = g(\Theta_{20}^{(1)}x_{0} + \Theta_{21}^{(1)}x_{1} + \Theta_{22}^{(1)}x_{2} + \Theta_{23}^{(1)}x_{3})$
$a_{3}^{(2)} = g(\Theta_{30}^{(1)}x_{0} + \Theta_{31}^{(1)}x_{1} + \Theta_{32}^{(1)}x_{2} + \Theta_{33}^{(1)}x_{3})$
$h_{\Theta}(x) = a_{1}^{(3)} = g(\Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{1}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)})$
for 3 units, 3 hidden units, $\Theta^{(1)} \in \mathbb{R}^{3\times4}$
if network has $s_{j}$ units in layer $j$, $s_{j + 1}$ units in layer $j + 1$, then $\Theta^{(j)}$ will be of dimension $s_{j + 1} \times (s_{j} + 1)$.
from above example:
$\Theta^{(j = 1)} = \Theta^{(1)}$:
$s_{j = 1} = s_{1} = 3$ units in layer $j = 1$
$s_{j + 1 = 1 + 1 = 2} = s_{2} = 3$ units in layer $j + 1 = 1 + 1 = 2$
$\rightarrow \Theta^{(1)} \in \mathbb{R}^{s_{j + 1 = 1 + 1 = 2} = s_{2} = 3 \times (s_{j = 1} = s_{1} = 3 + 1 = 4) = 3 \times 4} = \mathbb{R}^{3\times4}$
note: for example, $\Theta^{(2)}$ should be interpreted as the matrix of parameters/weights that controls the function that maps from the hidden units (in layer 2) to the one (layer 3) output unit
while the above artifical neural network defines a function $h$ that hopefully maps inputs $x$ to some space of predictions $y$; as $\Theta$ is varied, different $h$'s result
$z$ values are weighted linear combinations of $x$ values that go into a particular neuron/activation unit
for example:
$z_{1}^{(2)} = \Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3}$
thus:
$a_{1}^{(2)} = g(\Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3}) = g(z_{1}^{(2)})$
$a_{2}^{(2)} = g(\Theta_{20}^{(1)}x_{0} + \Theta_{21}^{(1)}x_{1} + \Theta_{22}^{(1)}x_{2} + \Theta_{23}^{(1)}x_{3}) = g(z_{2}^{(2)})$
$a_{3}^{(2)} = g(\Theta_{30}^{(1)}x_{0} + \Theta_{31}^{(1)}x_{1} + \Theta_{32}^{(1)}x_{2} + \Theta_{33}^{(1)}x_{3}) = g(z_{3}^{(2)})$
$h_{\Theta}(x) = a_{1}^{(3)} = g(\Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{1}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)})$
$x = \begin{bmatrix}x_{0}\\x_{1}\\x_{2}\\x_{3}\end{bmatrix}$ $z^{(2)} = \begin{bmatrix}z_{1}^{(2)}\\z_{2}^{(2)}\\z_{3}^{(2)}\end{bmatrix}$
$z$ is a 3-dimensional vector (i.e., $z \in \mathbb{R}^{3}$), as is $a^{(2)}$
$z^{(2)} = \Theta^{(1)}x$
$a^{(2)} = g(z^{(2)})$
$g$ applies the sigmoid function element-wise to each of $z^{(2)}$'s elements
if $a^{(1)}$ can be defined as activations of the first (input) layer, then $a^{(1)} = x$
thus: $z^{(2)} = \Theta^{(1)}a^{(1)}$
add bias unit $a_{0}^{(2)} = 1$, making $a^{(2)} \in \mathbb{R}^4$
$z^{(3)} = \Theta^{(2)}a^{(2)}$
$h_{\Theta}(x) = a^{(3)} = g(z^{(3)})$
process of computing $h_{\Theta}(x)$ is called forward propagation because the activations are forward-propagated to the hidden layer, the activations of which are then forward-propagated to the output layer
large negative weights to negate variables
$x_{1},x_{2} \in \{0,1\}$
$y = x_{1} \wedge x_{2}$
$\Theta^{(1)} = \begin{bmatrix}-30\\20\\20\end{bmatrix}$
$\Theta_{10}^{(1)} = -30$
$\Theta_{11}^{(1)} = 20$
$\Theta_{12}^{(1)} = 20$
$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(-30 + 20x_{1} + 20x_{2})$
$x_{1}$ | $x_{2}$ | $x_{1} \wedge x_{2} \approx h_{\Theta}(x)$ |
---|---|---|
0 | 0 | $0 \approx g\big[-30(1) + 20(0) + 20(0) \big] = g(-30)$ |
0 | 1 | $0 \approx g\big[-30(1) + 20(0) + 20(1) \big] = g(-10)$ |
1 | 0 | $0 \approx g\big[-30(1) + 20(1) + 20(0) \big] = g(-10)$ |
1 | 1 | $1 \approx g\big[-30(1) + 20(1) + 20(1) \big] = g(10)$ |
In [21]:
from IPython.display import Image
Image(filename='images/AND.png', width=400)
Out[21]:
In [22]:
# hypothesis, sigmoid function
def g(z):
return special.expit(z)
x = np.arange(-10,10)
plt.plot(x,g(x))
plt.show()
$x_{1} \in \{0,1\}$
$y = \bar{x}_{1}$
$\Theta^{(1)} = \begin{bmatrix}10\\-20\end{bmatrix}$
$\Theta_{10}^{(1)} = 10$
$\Theta_{11}^{(1)} = -20$
$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1}) = g(10 + -20x_{1})$
$x_{1}$ | $\bar{x}_{1} \approx h_{\Theta}(x)$ |
---|---|
0 | $1 \approx g\big[10(1) + -20(0) \big] = g(10)$ |
1 | $0 \approx g\big[10(1) + -20(1) \big] = g(-10)$ |
$x_{1},x_{2} \in \{0,1\}$
$y = \bar{x}_{1} \wedge \bar{x}_{2}$
$\Theta^{(1)} = \begin{bmatrix}10\\-20\\-20\end{bmatrix}$
$\Theta_{10}^{(1)} = 10$
$\Theta_{11}^{(1)} = -20$
$\Theta_{12}^{(1)} = -20$
$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(10 + -20x_{1} + -20x_{2})$
$x_{1}$ | $x_{2}$ | $\bar{x}_{1}$ | $\bar{x}_{2}$ | $\bar{x}_{1} \wedge \bar{x}_{2} \approx h_{\Theta}(x)$ |
---|---|---|---|---|
0 | 0 | 1 | 1 | $1 \approx g\big[10(1) + -20(0) + -20(0) \big] = g(10)$ |
0 | 1 | 1 | 0 | $0 \approx g\big[10(1) + -20(0) + -20(1) \big] = g(-10)$ |
1 | 0 | 0 | 1 | $0 \approx g\big[10(1) + -20(1) + -20(0) \big] = g(-10)$ |
1 | 1 | 0 | 0 | $0 \approx g\big[10(1) + -20(1) + -20(1) \big] = g(-30)$ |
$x_{1},x_{2} \in \{0,1\}$
$y = x_{1} \vee x_{2}$
$\Theta^{(1)} = \begin{bmatrix}-10\\20\\20\end{bmatrix}$
$\Theta_{10}^{(1)} = -10$
$\Theta_{11}^{(1)} = 20$
$\Theta_{12}^{(1)} = 20$
$h_{\Theta}(x) = g(\Theta_{10}^{(1)} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2}) = g(-10 + 20x_{1} + 20x_{2})$
$x_{1}$ | $x_{2}$ | $x_{1} \vee x_{2} \approx h_{\Theta}(x)$ |
---|---|---|
0 | 0 | $0 \approx g\big[-10(1) + 20(0) + 20(0) \big] = g(-10)$ |
0 | 1 | $1 \approx g\big[-10(1) + 20(0) + 20(1) \big] = g(10)$ |
1 | 0 | $1 \approx g\big[-10(1) + 20(1) + 20(0) \big] = g(10)$ |
1 | 1 | $1 \approx g\big[-10(1) + 20(1) + 20(1) \big] = g(30)$ |
$x_{1},x_{2}$ are binary ($0$ or $1$)
In [23]:
ax = plt.subplot(111)
ax.spines['left'].set_position('zero')
ax.spines['bottom'].set_position('zero')
plt.scatter([0,1],[0,1], s=100, marker='x',c='red')
plt.scatter([1,0],[0,1], s=100, marker='o',facecolors='none', edgecolors='blue')
plt.xticks([0,1])
plt.yticks([0,1])
plt.show()
In [24]:
from IPython.display import Image
Image(filename='images/XNOR.png', width=500)
Out[24]:
$x_{1},x_{2} \in \{0,1\}$
$y = \big(x_{1} \oplus x_{2}\big)'$
$x_{1}$ | $x_{2}$ | $a_{1}^{(2)} = x_{1} \wedge x_{2}$ | $a_{2}^{(2)} = \bar{x}_{1} \wedge \bar{x}_{2}$ | $a_{1}^{(2)} \vee a_{2}^{(2)} \approx h_{\Theta}(x)$ |
---|---|---|---|---|
0 | 0 | 0 | 1 | 1 |
0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 1 |
$h_{\Theta}(x) \in \mathbb{R}^{4}$
want:
when pedestrian, $h_{\Theta}(x) \approx \begin{bmatrix}1\\0\\0\\0\end{bmatrix}$
when car, $h_{\Theta}(x) \approx \begin{bmatrix}0\\1\\0\\0\end{bmatrix}$
when motorcycle, $h_{\Theta}(x) \approx \begin{bmatrix}0\\0\\1\\0\end{bmatrix}$
when truck, $h_{\Theta}(x) \approx \begin{bmatrix}0\\0\\0\\1\end{bmatrix}$
training set: $(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), ... , (x^{(m)},y^{(m)})$
$y^{(i)}$ is one of the vectors above depending on the corresponding image $x^{(i)}$
so one training example will be one pair $(x^{(i)},y^{(i)})$
attempt to find a way for NN to output values so that $h_{\Theta}(x) \approx y^{(i)}$
MNIST data
5000 training examples, where each example is a 20 pixel by 20 pixel grayscale image of a handwritten digit (0–9)
each pixel is represented by a floating-point number indicating the grayscale intensity at that pixel location
thus, each training example in the data set is a 400-dimensional vector, which becomes a single row in a $5000 \times 400$ matrix $X$
$X = \begin{bmatrix}--- (x^{(1)})^{T} ---\\--- (x^{(2)})^{T} ---\\.\\.\\.\\--- (x^{(m)})^{T} ---\end{bmatrix}$
$y$ is a 5000-dimensional vector that contains labels for the training set
In [25]:
X = np.loadtxt('X.csv', delimiter=',')
y = np.loadtxt('y.csv', delimiter=',')
x = X[4999]
x.shape = (20,20)
print x.shape
plt.matshow(x, cmap=mpl.cm.gray_r)
plt.show()
In [26]:
indexes = np.random.random_integers(0, high=4999, size=100)
from mpl_toolkits.axes_grid1 import ImageGrid
im = np.arange(100)
im.shape = 10, 10
fig = plt.figure()
grid = ImageGrid(fig, 111, # similar to subplot(111)
nrows_ncols = (10, 10),
axes_pad=0, # pad between axes in inch.
)
ax.set_axis_off()
plt.axis('off')
for i in range(100):
grid[i].imshow(X[indexes[i]].reshape((20,20)), cmap=mpl.cm.gray_r, interpolation='nearest')
plt.show()
use multiple one-vs-all logistic regression models to build a multi-class classifier
since 10 classes, need to train 10 separate logistic regression classifiers
$X = \begin{bmatrix}--- (x^{(1)})^{T} ---\\--- (x^{(2)})^{T} ---\\.\\.\\.\\--- (x^{(m)})^{T} ---\end{bmatrix}$
$\theta = \begin{bmatrix}\theta_{0}\\\theta_{1}\\.\\.\\.\\\theta_{n}\end{bmatrix}$
matrix product ($a^{T}b = b^{T}a$, if $a$ and $b$ are vectors)
$X\theta = \begin{bmatrix}--- (x^{(1)})^{T}\theta ---\\--- (x^{(2)})^{T}\theta ---\\.\\.\\.\\--- (x^{(m)})^{T}\theta ---\end{bmatrix} = \begin{bmatrix}--- \theta^{T}(x^{(1)}) ---\\--- \theta^{T}(x^{(2)}) ---\\.\\.\\.\\--- \theta^{T}(x^{(m)}) ---\end{bmatrix}$
for each $i$, calculate $\theta^{T}(x^{(i)})$
unregularized cost function for logistic regression:
$J(\theta) = \frac{1}{m}\sum\limits_{i=1}^m\big[-y^{(i)}\log\big(h_{\theta}(x^{(i)})\big)-(1-y^{(i)})\log\big(1 - h_{\theta}(x^{(i)})\big)\big]$
In [27]:
def g(z):
return sp.special.expit(z)
def J(m,y,X,theta):
matrix_product = np.multiply(X,theta)
diff = (-1 * y).dot(np.log(g(matrix_product))) - (1 - y).dot(np.log(1 - g(matrix_product)))
summation = np.sum(diff)
return 1.0/m * summation
In [28]:
theta = np.zeros((y.shape[0],1))
J(X.shape[0],y,X,theta)
Out[28]:
the gradient of the unregularized logistic regression cost is a vector where the $j$-th element is defined as:
$\frac{\partial{J}}{\partial{\theta_{j}}} = \frac{1}{m}\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\big)$
all partial derivatives for $\theta_{j}$:
$\begin{bmatrix}
\frac{\partial{J}}{\partial{\theta_{0}}}\\
\frac{\partial{J}}{\partial{\theta_{1}}}\\
\frac{\partial{J}}{\partial{\theta_{2}}}\\
\vdots\\
\frac{\partial{J}}{\partial{\theta_{n}}}
\end{bmatrix}=\frac{1}{m}
\begin{bmatrix}
\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\big)\\
\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{1}^{(i)}\big)\\
\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{2}^{(i)}\big)\\
\vdots\\
\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{n}^{(i)}\big)
\end{bmatrix}$
$=\frac{1}{m}\sum\limits_{i=1}^{m}\big((h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\big)$
$=\frac{1}{m}X^{T}(h_{\theta}(x)-y)$
where $h_{\theta}(x)-y = \begin{bmatrix} h_{\theta}(x^{(1)})-y^{(1)}\\ h_{\theta}(x^{(2)})-y^{(2)}\\ \vdots\\ h_{\theta}(x^{(m)})-y^{(m)}\\ \end{bmatrix}$
note:
$x^{(i)}$ is a vector
$h_{\theta}(x^{(i)})-y^{(i)}$ is a scalar