Torch has a package for neural networks called 'nn'. It can be imported into torch using:
In [1]:
nn = require 'nn';
There are many types of neural networks designed over the years. Most neural networks can be represented as directed acyclic graphs. However (just as a note), we'll see a class of neural networks called recurrent neural networks that can't be represented in that manner. The simplest type of neural networks are sequential feed-forward neural networks. Multilayer perceptrons and Convolutional Neural Networks do fall in that category.
To create a sequential network, we use a super-module called 'nn.Sequential'. Other modules can be added to this super-module to create a sequential neural network.
A variant to the sequential network is the parallel module which has multiple neural network pipelines running in parallel. Also, if you want even more power and to design recurrent neural networks you would use another package called the 'nngraph' package. But, we are getting ahead of ourselves.
In [2]:
model = nn.Sequential()
In [3]:
model:add(nn.Linear(3,4))
Out[3]:
Ouch, that printed a lot of things. Let's ignore most of it for now and concentrate on the 3 and 4. So, it takes a vector of size 3 and embeds it into a vector space of 4 dimensions.
The linear module is basically just: $$ Y = W^{\top}X + B$$ where X, Y and B are vectors and W is a matrix.
If you notice in the output of the module, there are internal variables for W called weight and B called bias.
Let's verify if they actually work like that. Consider an initialization for X:
In [4]:
X = torch.randn(3)
In [5]:
X
Out[5]:
In [6]:
Y = model:forward(X)
In [7]:
Y
Out[7]:
Now, let's replicate that as a math calculation.
First let's access the weight of that model.
In [8]:
W = model:get(1).weight
In [9]:
W
Out[9]:
The model:get(1) command accesses the first module of model which is the Linear module we just defined. Next we access the weight variable of that module by chaining a '.weight'.
Similarly, we can get the bias of that linear module like this:
In [43]:
B = model:get(1).bias
In [44]:
B
Out[44]:
Now we can actually do some math. The Linear module can be expressed in torch maths as follows:
In [45]:
-- The * operator is overloaded with matrix multiplication
W*X+B
Out[45]:
Compare this with the output of the above model.
In [13]:
model:get(1).output
Out[13]:
Yes, the modules store their previous outputs in the output variable. This is useful in calculating the gradient later on.
Just like the forward pass gave us the output, a backward pass through a module will give us its backpropagated gradient.
Before doing anything, we should clear our gradient parameters. What gradient parameters?
Let's print out the internals of model again:
In [14]:
model
Out[14]:
Do you notice the grad variables? Here's a mapping between these variables and mathematical expressions:
Torch's :backward() function allows you to calculate the above values easily. However, we need to perform one operation before doing that, which is:
In [15]:
model:zeroGradParameters()
Out[15]:
This is essential because sometimes garbage values in the grad parameters spoil your gradients. Now, let's check out the values of these parameters.
In [16]:
model:get(1).gradBias
Out[16]:
In [17]:
model:get(1).gradWeight
Out[17]:
In [18]:
model:get(1).gradInput
Out[18]:
Now, we perform the backward pass:
In [19]:
gradOutput = torch.Tensor{1,2,3,4}
In [20]:
-- format --> model:backward(input_passed, error_that_came_from_previous_modules)
model:backward(X, gradOutput)
Out[20]:
The second parameter of backward (gradOutput) contains the error passed down from the layers above this layer. However, since we don't have any modules above the Linear module, we pass a vector of 2s (just for a demonstration). The mapping between this and a mathematical expression is: $$gradOutput = \frac{\partial E}{\partial Y}$$ if E is the final Energy/Error of your model.
Now, let's confirm that the values are being calculated as per their respective mathematical expressions.
First let's write down the expression for Y. $$Y = \begin{bmatrix} w_{11} x_1 + w_{12} x_2 + w_{13} x_3 + b_1\\ w_{21} x_1 + w_{22} x_2 + w_{23} x_3 + b_2\\ w_{31} x_1 + w_{32} x_2 + w_{33} x_3 + b_3\\ w_{41} x_1 + w_{42} x_2 + w_{43} x_3 + b_4 \end{bmatrix} $$
According to analytical expression for $\frac{\partial E}{\partial B}$, $$ \frac{\partial E}{\partial B} = \frac{\partial E}{\partial Y} \frac{\partial Y}{\partial B}$$ Now, $\frac{\partial Y}{\partial B}$ is a jacobian of vector Y w.r.t. vector B. This is actually equal to the identity matrix I. Therefore, $$ \frac{\partial E}{\partial B} = \frac{\partial E}{\partial Y} I = \frac{\partial E}{\partial Y}$$
In [21]:
model:get(1).gradBias
Out[21]:
According to analytical derivation, $$\frac{\partial E}{\partial W} = \frac{\partial E}{\partial Y} \frac{\partial Y}{\partial W} = \frac{\partial E}{\partial Y} X = \begin{bmatrix} x_1 \frac{\partial E}{\partial y_1} & x_2 \frac{\partial E}{\partial y_1} & x_3 \frac{\partial E}{\partial y_1} \\ x_1 \frac{\partial E}{\partial y_2} & x_2 \frac{\partial E}{\partial y_2} & x_3 \frac{\partial E}{\partial y_2} \\ x_1 \frac{\partial E}{\partial y_3} & x_2 \frac{\partial E}{\partial y_3} & x_3 \frac{\partial E}{\partial y_3} \\ x_1 \frac{\partial E}{\partial y_4} & x_2 \frac{\partial E}{\partial y_4} & x_3 \frac{\partial E}{\partial y_4} \end{bmatrix}$$ We perform the top three operations below.
In [46]:
gradOutput:reshape(4,1)*X:reshape(1,3)
Out[46]:
Match this with:
In [25]:
model:get(1).gradWeight
Out[25]:
The analytical expression for $\frac{\partial E}{\partial X}$ is: $$\frac{\partial E}{\partial X} = \frac{\partial E}{\partial Y}\frac{\partial Y}{\partial X} = \begin{bmatrix} \frac{\partial E}{\partial y_1} & \frac{\partial E}{\partial y_2} & \frac{\partial E}{\partial y_3} & \frac{\partial E}{\partial y_4} \end{bmatrix} \times \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{bmatrix} $$
To confirm the above expression, here's the weight matrix:
In [29]:
W
Out[29]:
If we perform the calculation mentioned above, we get:
In [40]:
W:transpose(1,2)*gradOutput:reshape(4,1)
Out[40]:
Compare this with:
In [41]:
model:get(1).gradInput
Out[41]:
In [ ]: