In [ ]:

Data Augmentation

  • Normalize scale of data
  • Pre-whitening

    $$\text{Assumption: } X \sim \mathcal{N}(\mu, \Sigma)$$

We want to get $\hat{x}$: $$\hat{X} \sim \mathcal{N}(0, I)$$

This can be done by:

  1. subtract the empirical mean $$\hat{X}_\mu = X - \mu$$
  2. compute the covaraince matrix $\Sigma$, then $$\hat{X} = \Sigma^{-\frac{1}{2}} ~ X$$

Why this is called Whitening? The name comes from white nose. If you apply Fourrier transformation on white nosem, we will see a flat line, which indicates that white nose has the same amplitude(energy) in all frequencies.

Problem setup

  • Very little data, similar dataset
  • Very little data, different datasets
  • Lots of data, similar dataset : fune-tune exisiting well-trained networks
  • Lots of data, different datasets
  • Fine-tuning existing well trained models
  • Balancing imbalanced datasets
  • Multi-task learning (combining loss function)
    • Example: for car detection and sgmentation, we can combine two tasks jointly and learn a model that odes both.
      The idea is that some of the weights will be shared. For example, learning edges, and other features will be shared. $$loss = loss_{det} + \lambda ~loss_{seg}$$
  • How to handle very large datasets?
    • Fitting model on one GPU:
      • use lower precision computation (float8, float16, float32)
        It is shown that for training, you need to use high precision, but for testing you can use lower precision
    • Model parallelism: split model across different GPUs
    • Data parallelism: split the data across GPUs

Weight initialization

  • Initialize the weights so that the dynamics of the wieghts are close to

  • As we said for the input features to be normalized, we want the outputs of each layer to be normalized

Ideas:

  • Initializing weights to be all zeros is s *very bad idea. IT results in all output neurons to be the same, and they behave the same.

  • Initialize randomly

  • Weights should not be too small, because it results in vanishing gradients

  • Make variance of weights be $\frac{1}{\sqrt{N}}$

Xavier initialization: $nVar(w_i) = 1$

In practice, it's better if $Var(w_i) = \sqrt{\frac{2}{n}}$

Choosing activation functions

  • If the variance of weights is too small or too large, both sigmoid and tanh functions should be avoided. (this can be shown in the plot of sigmoid funcitons, whwere at large $x$ values, the sigmoid or tanh becomes very flat, meaning that the gradients are very very small)

  • ReLU and its variants are better

    • ReLU
    • Leaky ReLU / PReLU
    • Randomized ReLU

Regularization

  • $L_1$ regularization: $\lambda_1 \|w\|_1$
  • $L_2$ regularization: $\lambda_1 \|w\|_2^2$
  • $L_1 + L_2$ regularization: $\lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2$
  • Max-norm: $\lambda_2 \|w\|$
  • MaxOut
  • Drop-out and Dropconnect

When we are overfitting, we can use dropout to make the model smaller

Optimization Tricks

  • Normalize gradients by mini-batch size
  • Schedule a learning rate (LR):
    • Set initial learning rate to $10^{-1}, 10^{-2}$
    • Once error reaches plateu on validations set, reduce LR by 2 and continue
    • Exponentially decaying LR
  • Early stopping: optimum is not optimum (for generalization)
  • Gradient clipping:
    • threshold the gradients
    • robustness to noise and outliers

Ensembles

  • Averaging model weights
  • Averaging model responses
  • Same architecture, different initialization
  • Create model emsmble from

In [ ]: