In [ ]:
%install-location $cwd/swift-install
%install '.package(path: "$cwd/FastaiNotebook_02a_why_sqrt5")' FastaiNotebook_02a_why_sqrt5


Installing packages:
	.package(path: "/home/jupyter/notebooks/swift/FastaiNotebook_02a_why_sqrt5")
		FastaiNotebook_02a_why_sqrt5
With SwiftPM flags: []
Working in: /tmp/tmp1on2fw8s/swift-install
[1/6] Compiling FastaiNotebook_02a_why_sqrt5 01_matmul.swift
[2/6] Compiling FastaiNotebook_02a_why_sqrt5 02_fully_connected.swift
[3/6] Compiling FastaiNotebook_02a_why_sqrt5 02a_why_sqrt5.swift
[4/6] Compiling FastaiNotebook_02a_why_sqrt5 00_load_data.swift
[5/6] Compiling FastaiNotebook_02a_why_sqrt5 01a_fastai_layers.swift
[6/7] Merging module FastaiNotebook_02a_why_sqrt5
[7/8] Compiling jupyterInstalledPackages jupyterInstalledPackages.swift
[8/9] Merging module jupyterInstalledPackages
[9/9] Linking libjupyterInstalledPackages.so
Initializing Swift...
Installation complete!

In [ ]:
//export
import Path
import TensorFlow

In [ ]:
import FastaiNotebook_02a_why_sqrt5

Why you need a good init

To understand why initialization is important in a neural net, we'll focus on the basic operation you have there: matrix multiplications. So let's just take a vector x, and a matrix a initialized randomly, then multiply them 100 times (as if we had 100 layers).


In [ ]:
var x = TF(randomNormal: [512, 1])
let a = TF(randomNormal: [512,512])

In [ ]:
for i in 0..<100 { x = a  x }

In [ ]:
(x.mean(),x.std())


Out[ ]:
▿ 2 elements
  - .0 : nan(0x1fffff)
  - .1 : nan(0x1fffff)

The problem you'll get with that is activation explosion: very soon, your activations will go to nan. We can even ask the loop to break when that first happens:


In [ ]:
var x = TF(randomNormal: [512, 1])
let a = TF(randomNormal: [512,512])

In [ ]:
for i in 0..<100 {
    x = a  x
    if x.std().scalarized().isNaN {
        print(i)
        break
    }
}


27

It only takes around 30 multiplications! On the other hand, if you initialize your activations with a scale that is too low, then you'll get another problem:


In [ ]:
var x = TF(randomNormal: [512, 1])
let a = TF(randomNormal: [512,512]) * 0.01

In [ ]:
for i in 0..<100 { x = a  x }

In [ ]:
(x.mean(),x.std())


Out[ ]:
▿ 2 elements
  - .0 : 0.0
  - .1 : 0.0

Here, every activation vanished to 0. So to avoid that problem, people have come with several strategies to initialize their weight matrices, such as:

  • use a standard deviation that will make sure x and Ax have exactly the same scale
  • use an orthogonal matrix to initialize the weight (orthogonal matrices have the special property that they preserve the L2 norm, so x and Ax would have the same sum of squares in that case)
  • use spectral normalization on the matrix A (the spectral norm of A is the least possible number M such that matmul(A,x).norm() <= M*x.norm() so dividing A by this M insures you don't overflow. You can still vanish with this)

The magic number for scaling

Here we will focus on the first one, which is the Xavier initialization. It tells us that we should use a scale equal to 1/sqrt(n_in) where n_in is the number of inputs of our matrix.


In [ ]:
var x = TF(randomNormal: [512, 1])
let a = TF(randomNormal: [512,512]) / sqrt(512)

In [ ]:
for i in 0..<100 { x = a  x }

In [ ]:
(mean: x.mean(), std: x.std())


Out[ ]:
▿ 2 elements
  - mean : 0.061417937
  - std : 1.4370023

And indeed it works. Note that this magic number isn't very far from the 0.01 we had earlier.


In [ ]:
1 / sqrt(512)


Out[ ]:
0.044194173824159216

But where does it come from? It's not that mysterious if you remember the definition of the matrix multiplication. When we do y = matmul(a, x), the coefficients of y are defined by

$$y_{i} = a_{i,0} x_{0} + a_{i,1} x_{1} + \cdots + a_{i,n-1} x_{n-1} = \sum_{k=0}^{n-1} a_{i,k} x_{k}$$

or in code:

for i in 0..<a.shape[0] {
    for k in 0..<b.shape[1] {
        y[i][0] += a[i][k] * x[k][0]
    }
}

Now at the very beginning, our x vector has a mean of roughly 0. and a standard deviation of roughly 1. (since we picked it that way).


In [ ]:
var x = TF(randomNormal: [512, 1])
(mean: x.mean(), std: x.std())


Out[ ]:
▿ 2 elements
  - mean : -0.013813974
  - std : 0.9675647

NB: This is why it's extremely important to normalize your inputs in Deep Learning, the initialization rules have been designed with inputs that have a mean 0. and a standard deviation of 1.

If you need a refresher from your statistics course, the mean is the sum of all the elements divided by the number of elements (a basic average). The standard deviation shows whether the data points stay close to the mean or are far away from it. It's computed by the following formula:

$$\sigma = \sqrt{\frac{1}{n}\left[(x_{0}-m)^{2} + (x_{1}-m)^{2} + \cdots + (x_{n-1}-m)^{2}\right]}$$

where m is the mean and $\sigma$ (the greek letter sigma) is the standard deviation. To avoid that square root, we also often consider a quantity called the variance, which is $\sigma$ squared.

Here we have a mean of 0, so the variance is just the mean of x squared, and the standard deviation is its square root.

If we go back to y = a @ x and assume that we chose weights for a that also have a mean of 0, we can compute the variance of y quite easily. Since it's random, and we may fall on bad numbers, we repeat the operation 100 times.


In [ ]:
var mean = Float()
var sqr = Float()
for i in 0..<100 {
    let x = TF(randomNormal: [512, 1])
    let a = TF(randomNormal: [512, 512])
    let y = a  x
    mean += y.mean().scalarized()
    sqr  += pow(y, 2).mean().scalarized()
}
(mean/100, sqr/100)


Out[ ]:
▿ 2 elements
  - .0 : 0.037324883
  - .1 : 512.3489

Now that looks very close to the dimension of our matrix 512. And that's no coincidence! When you compute y, you sum 512 product of one element of a by one element of x. So what's the mean and the standard deviation of such a product of one element of a by one element of x? We can show mathematically that as long as the elements in a and the elements in x are independent, the mean is 0 and the std is 1.

This can also be seen experimentally:


In [ ]:
var mean = Float()
var sqr = Float()
for i in 0..<10000 {
    let x = TF(randomNormal: [])
    let a = TF(randomNormal: [])
    let y = a * x
    mean += y.scalarized()
    sqr  += pow(y, 2).scalarized()
}
(mean/10000,sqrt(sqr/10000))


Out[ ]:
▿ 2 elements
  - .0 : -0.00084701786
  - .1 : 0.9972176

Then we sum 512 of those things that have a mean of zero, and a variance of 1, so we get something that has a mean of 0, and variance of 512. To go to the standard deviation, we have to add a square root, hence sqrt(512) being our magic number.

If we scale the weights of the matrix a and divide them by this sqrt(512), it will give us a y of scale 1, and repeating the product as many times as we want and it won't overflow or vanish.

Adding ReLU in the mix

We can reproduce the previous experiment with a ReLU, to see that this time, the mean shifts and the variance becomes 0.5. This time the magic number will be math.sqrt(2/512) to properly scale the weights of the matrix.


In [ ]:
var mean = Float()
var sqr = Float()
for i in 0..<10000 {
    let x = TF(randomNormal: [])
    let a = TF(randomNormal: [])
    var y = (a*x).scalarized()
    y = y < 0 ? 0 : y
    mean += y
    sqr  += pow(y, 2)
}
(mean: mean/10000, sqrt: sqr/10000)


Out[ ]:
▿ 2 elements
  - mean : 0.33342138
  - sqrt : 0.537505

We can double check by running the experiment on the whole matrix product. The variance becomes 512/2 this time:


In [ ]:
var mean = Float()
var sqr = Float()
for i in 0..<100 {
    let x = TF(randomNormal: [512, 1])
    let a = TF(randomNormal: [512, 512])
    var y = a  x
    y = max(y, TF(zeros: y.shape))
    mean += y.mean().scalarized()
    sqr  += pow(y, 2).mean().scalarized()
}
(mean: mean/100, sqrt: sqr/100)


Out[ ]:
▿ 2 elements
  - mean : 8.947798
  - sqrt : 252.94208

Or that scaling the coefficient with the magic number gives us a scale of 1.


In [ ]:
var mean = Float()
var sqr = Float()
for i in 0..<100 {
    let x = TF(randomNormal: [512, 1])
    let a = TF(randomNormal: [512, 512]) * sqrt(2/512)
    var y = a  x
    y = max(y, TF(zeros: y.shape))
    mean += y.mean().scalarized()
    sqr  += pow(y, 2).mean().scalarized()
}
(mean: mean/100, sqrt: sqr/100)


Out[ ]:
▿ 2 elements
  - mean : 0.5630086
  - sqrt : 0.99951345

The math behind is a tiny bit more complex, and you can find everything in the Kaiming and the Xavier paper but this gives the intuition behind those results.


In [ ]: