Neuron pre-activation (or input-activation): $a(\mathbf{x}) = b + \sum_i w_i x_i = b + \mathbf{w}^T\mathbf{x}$
$w_i$ are called connection weights
$b$ is bias
Neuron (output) activation: $h(\mathbf{x}) = g\left(a(\mathbf{x})\right) = g\left(b + \mathbf{w}^T \mathbf{x}\right)$
$g(.)$ is called activation function
In [4]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.arange(-5, 5.1, 0.1)
def sigmoid(vec):
return 1/(1+np.exp(-x))
def tanh_activation(vec):
return (np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))
def rectlinear(vec):
output = vec.copy()
output[vec<0] = 0
return output
plt.figure(figsize=(8,8))
plt.subplot(2,2,1)
plt.plot(x, x)
plt.subplot(2,2,2)
plt.plot(x, sigmoid(x))
plt.subplot(2,2,3)
plt.plot(x, tanh_activation(x))
plt.subplot(2,2,4)
plt.plot(x, rectlinear(x))
plt.show()
A single neuron with sigmoid activation can be interpreted as an estimation of $p(y=1\vert \mathbf{x})$.
Similar idea can also be used for Tanh actication.
NN estimates $f(x)_c = p(y=c\vert x)$
another variation: $\displaystyle \theta^* =arg \min \left(- \sum_t \log P(y=c\vert x_t)\right)$
Numerical instability: since $P(..)$ are small numbers, mulltiplications of such small numbers many times will cause loosing the precision. $\log$ computation will help with nmerical stability.
Notation: $[x]_+ = \max (0, x)$
Manhattan Loss: $\|y - f(x)\|_1$
Huber Loss: $\left\{\begin{array}{lr} 1/2\|y-f(x)\|_2^2 & for \|y-f(x)\|_2^2 \| \delta^2\\ \delta \|y-f(x)\|_1 - 1/\delta^2 & \text{otherwise}\end{array} \right.$
KL divergence (if dealing with probabilities): $\sum p_i \log \frac{p_i}{q_i}$
Note: In classification, error (risk/missclassification) is determined by product $y_i f(x_i)$. But in regression, it is determined via the difference $y_i-f(x_i)$.
Embedding: We want to map vector $x,y$ into a new space $Z$ to get $z_x,z_y$. Then, we compare $z_x,z_y$ in this new space: $$-1 \le \frac{z_x^T z_y}{\|z_x\| \ \|z_y\|} \le +1$$
Cosine distance: $\frac{x^T y}{\|x\|\ \|y\|}$
Triplet loss: $(1 + d(x_i,x_j) - d(x_i,x_k))_+$
$L_1$ norm: $\Omega(\theta) = \sum\sum\sum |W_{i,j}^k|$
p-norm:
Geometric interpretation:
$\|y-Wx\|_2^2$ form some contours. If $W$ is unit vector, in 2-dimensional, contours are circles. Otherwise, in 2-dimensional, contours form elipse, or in higher dimensions, elipsoid.
Regularization: $\|y - Wx\|_2^2 + \lambda \|W\|_2^2$
the solution will be the intersection of above contours, and the contours of unit vectors from the regulariation terms:
$L_2$:
In [ ]: