In [ ]:
Pre-whitening
$$\text{Assumption: } X \sim \mathcal{N}(\mu, \Sigma)$$
We want to get $\hat{x}$: $$\hat{X} \sim \mathcal{N}(0, I)$$
This can be done by:
Why this is called Whitening? The name comes from white nose. If you apply Fourrier transformation on white nosem, we will see a flat line, which indicates that white nose has the same amplitude(energy) in all frequencies.
Initialize the weights so that the dynamics of the wieghts are close to
As we said for the input features to be normalized, we want the outputs of each layer to be normalized
Ideas:
Initializing weights to be all zeros is s *very bad idea. IT results in all output neurons to be the same, and they behave the same.
Initialize randomly
Weights should not be too small, because it results in vanishing gradients
Make variance of weights be $\frac{1}{\sqrt{N}}$
Xavier initialization: $nVar(w_i) = 1$
In practice, it's better if $Var(w_i) = \sqrt{\frac{2}{n}}$
If the variance of weights is too small or too large, both sigmoid and tanh functions should be avoided. (this can be shown in the plot of sigmoid funcitons, whwere at large $x$ values, the sigmoid or tanh becomes very flat, meaning that the gradients are very very small)
ReLU and its variants are better
When we are overfitting, we can use dropout to make the model smaller
In [ ]: