Feed-Forward Networks

A basic artifical neuron

Neuron pre-activation (or input-activation): $a(\mathbf{x}) = b + \sum_i w_i x_i = b + \mathbf{w}^T\mathbf{x}$
$w_i$ are called connection weights
$b$ is bias
Neuron (output) activation: $h(\mathbf{x}) = g\left(a(\mathbf{x})\right) = g\left(b + \mathbf{w}^T \mathbf{x}\right)$
$g(.)$ is called activation function

Activaiton functions

Linear Activation: $g(x) = x$
Sigmoid Activation: $\displaystyle g(x) = \frac{1}{1+e^{-x}}$
Hyperbolic Tnagent (Tanh) Activation: $\displaystyle g(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
Rectified Linear Activation: $g(x) = max(x,0)$



In [4]:

    
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

x = np.arange(-5, 5.1, 0.1)

def sigmoid(vec):
    return 1/(1+np.exp(-x))
    
def tanh_activation(vec):
    return (np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))

def rectlinear(vec):
    output = vec.copy()
    output[vec<0] = 0
    return output

plt.figure(figsize=(8,8))
plt.subplot(2,2,1)
plt.plot(x, x)
plt.subplot(2,2,2)
plt.plot(x, sigmoid(x))
plt.subplot(2,2,3)
plt.plot(x, tanh_activation(x))
plt.subplot(2,2,4)
plt.plot(x, rectlinear(x))
plt.show()

Binary classification

A single neuron with sigmoid activation can be interpreted as an estimation of $p(y=1\vert \mathbf{x})$.
- $\text{If } p(y=1\vert \mathbf{x}) > 0.5 \Rightarrow \ \text{ predict class=1}$
- $\text{If } p(y=1\vert \mathbf{x}) < 0.5 \Rightarrow \ \text{ predict class=0}$
  This is also known as Logistic Regression
Similar idea can also be used for Tanh actication.

Capacity of a single neuron

A single neuron can solve problems with linearly separable classes:

But it cannot solve non-linear problems

Parameter Learning in Neural Networks

based on "Empricial risk minimization"

$$arg \min_\theta \frac{1}{T} l(f(x^t;\theta), y^t) + \lambda \Omega (\theta)$$

Cost funciton: $l(f(x^t;\theta), y^t)$ (also called risk)
Regularizer: $\Omega (\theta)$
$\lambda$ is a trade-off between the two terms

NN estimates $f(x)_c = p(y=c\vert x)$
- we want to maximize $\displaystyle \theta^* = arg \max_\theta \frac{1}{T} \prod_t P(y=c\vert x_t) + \lambda \Omega (\theta)$
- If we have $\prod_t P(y=c\vert x_t)$ it can be re-written as $arg \max_\theta \frac{1}{T} \sum_t \log P(y=c\vert x_t)$. Since $\log$ is a monotonic increasing funciton.
- another variation: $\displaystyle \theta^* =arg \min \left(- \sum_t \log P(y=c\vert x_t)\right)$
- Numerical instability: since $P(..)$ are small numbers, mulltiplications of such small numbers many times will cause loosing the precision. $\log$ computation will help with nmerical stability.

Loss functions for classification

$0-1$ loss: $l()=\sum_i \mathcal{I}(\hat{y}_i = y_i)$
- The most ideal, but not practical for optimization

Surrogate loss functions
- Squared loss: $(y-f(x))^2$ (bad loss function, but still used)
- Logistic loss function $\log \left(1 + e^{-y f(x)}\right)$
- Hinge loss: $(1 - yf(x))_+$
- Squared Hinge loss: $(1-yf(x)))^2_+$

Notation: $[x]_+ = \max (0, x)$

Loss functions for regression

Euclidean Loss: $\|y - f(x)\|_2^2$
Manhattan Loss: $\|y - f(x)\|_1$
- Manhattan loss is more aggressive than Euclidean for values close to zero
- But less aggressive for larger values
Huber Loss: $\left\{\begin{array}{lr} 1/2\|y-f(x)\|_2^2 & for \|y-f(x)\|_2^2 \| \delta^2\\ \delta \|y-f(x)\|_1 - 1/\delta^2 & \text{otherwise}\end{array} \right.$
KL divergence (if dealing with probabilities): $\sum p_i \log \frac{p_i}{q_i}$

Note: In classification, error (risk/missclassification) is determined by product $y_i f(x_i)$. But in regression, it is determined via the difference $y_i-f(x_i)$.

Euclidean distance: $\|x - y\|_2^2 = (x-y)^T (x-y)$
Mahalanobis metric: $(x-y)^T M (x-y)$

Loss functions for embeddings

Embedding: We want to map vector $x,y$ into a new space $Z$ to get $z_x,z_y$. Then, we compare $z_x,z_y$ in this new space: $$-1 \le \frac{z_x^T z_y}{\|z_x\| \ \|z_y\|} \le +1$$

Cosine distance: $\frac{x^T y}{\|x\|\ \|y\|}$
Triplet loss: $(1 + d(x_i,x_j) - d(x_i,x_k))_+$
- condier triple points: $i,j,k$
- $i,k$ are from the same class, while $i,k$ are from two different classes

Regularization

$L_2$ regularization: $\Omega(\theta) = \sum\sum\sum W_{i,j}^k$
- gradient:
$L_1$ norm: $\Omega(\theta) = \sum\sum\sum |W_{i,j}^k|$
- Gradient: $$sign(W_{i,j}^k) = \left\{\begin{array}{lr} 1 & for\ W_{i,j}^k>0\\-1 & for\ W_{i,j}^k<0 \end{array}\right.$$
- At zero, we have to use sub-gradien: any line between the
p-norm:
- $p=\infty$
- $p=2$ (same as $L_2$ norm)
- $p=1$ (same as $L_1$ norm)
- $0<p<1$
- $p = 0$

Effect of regularizaiton:

Geometric interpretation:

$\|y-Wx\|_2^2$ form some contours. If $W$ is unit vector, in 2-dimensional, contours are circles. Otherwise, in 2-dimensional, contours form elipse, or in higher dimensions, elipsoid.
Regularization: $\|y - Wx\|_2^2 + \lambda \|W\|_2^2$
- the solution will be the intersection of above contours, and the contours of unit vectors from the regulariation terms:
- $L_2$:
- $L_1$ always intersect at one of the axes $\Rightarrow$ some of the parameters will be zero.
- For any $p<2$, they have sharp corners. As a result, we get sparse solution.



In [ ]: