In [ ]:

Data Augmentation

Normalize scale of data
Pre-whitening

$$\text{Assumption: } X \sim \mathcal{N}(\mu, \Sigma)$$

We want to get $\hat{x}$: $$\hat{X} \sim \mathcal{N}(0, I)$$

This can be done by:

subtract the empirical mean $$\hat{X}_\mu = X - \mu$$
compute the covaraince matrix $\Sigma$, then $$\hat{X} = \Sigma^{-\frac{1}{2}} ~ X$$

Why this is called Whitening? The name comes from white nose. If you apply Fourrier transformation on white nosem, we will see a flat line, which indicates that white nose has the same amplitude(energy) in all frequencies.

Problem setup

Very little data, similar dataset
Very little data, different datasets
Lots of data, similar dataset : fune-tune exisiting well-trained networks
Lots of data, different datasets

Fine-tuning existing well trained models
Balancing imbalanced datasets
Multi-task learning (combining loss function)
- Example: for car detection and sgmentation, we can combine two tasks jointly and learn a model that odes both.
  The idea is that some of the weights will be shared. For example, learning edges, and other features will be shared. $$loss = loss_{det} + \lambda ~loss_{seg}$$

How to handle very large datasets?
- Fitting model on one GPU:
  - use lower precision computation (float8, float16, float32)
    It is shown that for training, you need to use high precision, but for testing you can use lower precision
- Model parallelism: split model across different GPUs
- Data parallelism: split the data across GPUs

Weight initialization

Initialize the weights so that the dynamics of the wieghts are close to
As we said for the input features to be normalized, we want the outputs of each layer to be normalized

Ideas:

Initializing weights to be all zeros is s *very bad idea. IT results in all output neurons to be the same, and they behave the same.
Initialize randomly
Weights should not be too small, because it results in vanishing gradients
Make variance of weights be $\frac{1}{\sqrt{N}}$

Xavier initialization: $nVar(w_i) = 1$

In practice, it's better if $Var(w_i) = \sqrt{\frac{2}{n}}$

Choosing activation functions

If the variance of weights is too small or too large, both sigmoid and tanh functions should be avoided. (this can be shown in the plot of sigmoid funcitons, whwere at large $x$ values, the sigmoid or tanh becomes very flat, meaning that the gradients are very very small)
ReLU and its variants are better
- ReLU
- Leaky ReLU / PReLU
- Randomized ReLU

Regularization

$L_1$ regularization: $\lambda_1 \|w\|_1$
$L_2$ regularization: $\lambda_1 \|w\|_2^2$
$L_1 + L_2$ regularization: $\lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2$
Max-norm: $\lambda_2 \|w\|$
MaxOut
Drop-out and Dropconnect

When we are overfitting, we can use dropout to make the model smaller

Optimization Tricks

Normalize gradients by mini-batch size
Schedule a learning rate (LR):
- Set initial learning rate to $10^{-1}, 10^{-2}$
- Once error reaches plateu on validations set, reduce LR by 2 and continue
- Exponentially decaying LR
Early stopping: optimum is not optimum (for generalization)
Gradient clipping:
- threshold the gradients
- robustness to noise and outliers

Ensembles

Averaging model weights
Averaging model responses
Same architecture, different initialization
Create model emsmble from



In [ ]: