In [ ]:
import torch
To understand why initialization is important in a neural net, we'll focus on the basic operation you have there: matrix multiplications. So let's just take a vector x
, and a matrix a
initiliazed randomly, then multiply them 100 times (as if we had 100 layers).
In [ ]:
x = torch.randn(512)
a = torch.randn(512,512)
In [ ]:
for i in range(100): x = a @ x
In [ ]:
x.mean(),x.std()
Out[ ]:
The problem you'll get with that is activation explosion: very soon, your activations will go to nan. We can even ask the loop to break when that first happens:
In [ ]:
x = torch.randn(512)
a = torch.randn(512,512)
In [ ]:
for i in range(100):
x = a @ x
if x.std() != x.std(): break
In [ ]:
i
Out[ ]:
It only takes 27 multiplications! On the other hand, if you initialize your activations with a scale that is too low, then you'll get another problem:
In [ ]:
x = torch.randn(512)
a = torch.randn(512,512) * 0.01
In [ ]:
for i in range(100): x = a @ x
In [ ]:
x.mean(),x.std()
Out[ ]:
Here, every activation vanished to 0. So to avoid that problem, people have come with several strategies to initialize their weight matices, such as:
torch.norm(A@x) <= M*torch.norm(x)
so dividing A by this M insures you don't overflow. You can still vanish with this)Here we will focus on the first one, which is the Xavier initialization. It tells us that we should use a scale equal to 1/math.sqrt(n_in)
where n_in
is the number of inputs of our matrix.
In [ ]:
import math
In [ ]:
x = torch.randn(512)
a = torch.randn(512,512) / math.sqrt(512)
In [ ]:
for i in range(100): x = a @ x
In [ ]:
x.mean(),x.std()
Out[ ]:
And indeed it works. Note that this magic number isn't very far from the 0.01 we had earlier.
In [ ]:
1/ math.sqrt(512)
Out[ ]:
But where does it come from? It's not that mysterious if you remember the definition of the matrix multiplication. When we do y = a @ x
, the coefficients of y
are defined by
or in code:
y[i] = sum([c*d for c,d in zip(a[i], x)])
Now at the very beginning, our x
vector has a mean of roughly 0. and a standard deviation of roughly 1. (since we picked it that way).
In [ ]:
x = torch.randn(512)
x.mean(), x.std()
Out[ ]:
NB: This is why it's extremely important to normalize your inputs in Deep Learning, the intialization rules have been designed with inputs that have a mean 0. and a standard deviation of 1.
If you need a refresher from your statistics course, the mean is the sum of all the elements divided by the number of elements (a basic average). The standard deviation represents if the data stays close to the mean or on the contrary gets values that are far away. It's computed by the following formula:
$$\sigma = \sqrt{\frac{1}{n}\left[(x_{0}-m)^{2} + (x_{1}-m)^{2} + \cdots + (x_{n-1}-m)^{2}\right]}$$where m is the mean and $\sigma$ (the greek letter sigma) is the standard deviation. Here we have a mean of 0, so it's just the square root of the mean of x squared.
If we go back to y = a @ x
and assume that we chose weights for a
that also have a mean of 0, we can compute the standard deviation of y
quite easily. Since it's random, and we may fall on bad numbers, we repeat the operation 100 times.
In [ ]:
mean,sqr = 0.,0.
for i in range(100):
x = torch.randn(512)
a = torch.randn(512, 512)
y = a @ x
mean += y.mean().item()
sqr += y.pow(2).mean().item()
mean/100,sqr/100
Out[ ]:
Now that looks very close to the dimension of our matrix 512. And that's no coincidence! When you compute y, you sum 512 product of one element of a by one element of x. So what's the mean and the standard deviation of such a product? We can show mathematically that as long as the elements in a
and the elements in x
are independent, the mean is 0 and the std is 1. This can also be seen experimentally:
In [ ]:
mean,sqr = 0.,0.
for i in range(10000):
x = torch.randn(1)
a = torch.randn(1)
y = a*x
mean += y.item()
sqr += y.pow(2).item()
mean/10000,sqr/10000
Out[ ]:
Then we sum 512 of those things that have a mean of zero, and a mean of squares of 1, so we get something that has a mean of 0, and mean of square of 512, hence math.sqrt(512)
being our magic number. If we scale the weights of the matrix a
and divide them by this math.sqrt(512)
, it will give us a y
of scale 1, and repeating the product has many times as we want won't overflow or vanish.
We can reproduce the previous experiment with a ReLU, to see that this time, the mean shifts and the standard deviation becomes 0.5. This time the magic number will be math.sqrt(2/512)
to properly scale the weights of the matrix.
In [ ]:
mean,sqr = 0.,0.
for i in range(10000):
x = torch.randn(1)
a = torch.randn(1)
y = a*x
y = 0 if y < 0 else y.item()
mean += y
sqr += y ** 2
mean/10000,sqr/10000
Out[ ]:
We can double check by running the experiment on the whole matrix product.
In [ ]:
mean,sqr = 0.,0.
for i in range(100):
x = torch.randn(512)
a = torch.randn(512, 512)
y = a @ x
y = y.clamp(min=0)
mean += y.mean().item()
sqr += y.pow(2).mean().item()
mean/100,sqr/100
Out[ ]:
Or that scaling the coefficient with the magic number gives us a scale of 1.
In [ ]:
mean,sqr = 0.,0.
for i in range(100):
x = torch.randn(512)
a = torch.randn(512, 512) * math.sqrt(2/512)
y = a @ x
y = y.clamp(min=0)
mean += y.mean().item()
sqr += y.pow(2).mean().item()
mean/100,sqr/100
Out[ ]: