In [1]:
# %load /Users/facaiyan/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
plt.rcParams['axes.grid'] = False
import numpy as np
#from IPython.display import SVG
def show_image(filename, figsize=None, res_dir=True):
if figsize:
plt.figure(figsize=figsize)
if res_dir:
filename = './res/{}'.format(filename)
plt.imshow(plt.imread(filename))
feedforward: no feedback connections.
input layer -> hidden layers -> output layer
\begin{align} y &= f(x; \theta) \approx f^*(x) \\ &= W^T \phi(x) + b \end{align}how to choose the mapping $\phi$?
In [2]:
show_image("fig6_2.png", figsize=(5, 8))
default activation function is rectified linear unit (ReLU): $g(z) = max\{0, z\}$
In [3]:
relu = lambda x: np.maximum(0, x)
x = np.linspace(-2, 2, 1000)
y = relu(x)
plt.ylim([-1, 3])
plt.grid(True)
plt.plot(x, y)
Out[3]:
important:
In most cases,
$\implies$ cross-entropy as the cost function.
maximum likelihdood in neural networks => cost function is simply the negative log-likelihood == cross-entropy.
\begin{equation} J(\theta) = - \mathbb{E}_{x, y \sim \hat{p}_{data}} \log p_{model}(y | x) \end{equation}advantage:
unusual property of the cross-entropy cose: does not have a minimum value (negative infinity). => regularization.
cost L2: learn mean of y when x is given.
cost L1: learn median of y when x is given.
Linear output layers are often used to produce the mean of a conditional Gaussian distribution.
这里意思是说,给定$x$,它对应的样本集$y$应是高斯分布。而用线性模型来学习,预测的正是样本集均值$f(x) = \bar{y}$。可见,这种情况常见于回归问题。
binary classification
\begin{align} P(y) &= \delta((2y - 1) z) \quad \text{where } z = w^T h + b \\ J(\theta) &= - \log P(y | x) \quad \text{undo exp} \\ &= \zeta ((1 - 2y) z) \end{align}maximum likelihood is almost always the preferred approach to training sigmoid output units.
multiple classification
\begin{equation} \operatorname{softmax}(z)_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)} \end{equation}\begin{align} \log \operatorname{softmax}(z)_i &= z_i - log \sum_j \exp(z_j) \\ & \approx z_i - \max_j (z_j) \end{align}Overall, unregularized maximum likelihood will drive the softmax to predict the fraction of counts of counts of each outcome observed in the training set.
The argument $z$ can be produced in two different ways:
softmax provides a "softened" version of the argmax.
In general, think neural network as representing a function $f(x; \thetha) = w$, which provides the parameters for a distribution over $y$. Our loss function can then be inperpreted as $- \log p(y; w(x))$.
The widespread saturation of sigmoidal units => use as hidden units is now discouraged.
universal approximation theorem: large MLP will be able to represent any function. However, we are not guaranteed that the training algorithm will be able to learn that function.
Empirically, greater depth does seem to result in better generalization.
In [ ]: