Neural Networks with torch7

Torch has a package for neural networks called 'nn'. It can be imported into torch using:



In [1]:

    
nn = require 'nn';

There are many types of neural networks designed over the years. Most neural networks can be represented as directed acyclic graphs. However (just as a note), we'll see a class of neural networks called recurrent neural networks that can't be represented in that manner. The simplest type of neural networks are sequential feed-forward neural networks. Multilayer perceptrons and Convolutional Neural Networks do fall in that category.

To create a sequential network, we use a super-module called 'nn.Sequential'. Other modules can be added to this super-module to create a sequential neural network.

A variant to the sequential network is the parallel module which has multiple neural network pipelines running in parallel. Also, if you want even more power and to design recurrent neural networks you would use another package called the 'nngraph' package. But, we are getting ahead of ourselves.

The sequential super-module

Here's how you create a sequential model:



In [2]:

    
model = nn.Sequential()

That's it! However, our model is empty and can't do anything yet.

The Linear module

Let's add a linear module to our model.



In [3]:

    
model:add(nn.Linear(3,4))









    Out[3]:





nn.Sequential {
  [input -> (1) -> output]
  (1): nn.Linear(3 -> 4)
}
{
  gradInput : DoubleTensor - empty
  modules : 
    {
      1 : 
        nn.Linear(3 -> 4)
        {
          gradBias : DoubleTensor - size: 4
          weight : DoubleTensor - size: 4x3
          gradWeight : DoubleTensor - size: 4x3
          gradInput : DoubleTensor - empty
          bias : DoubleTensor - size: 4
          output : DoubleTensor - empty
        }
    }
  output : DoubleTensor - empty
}

Ouch, that printed a lot of things. Let's ignore most of it for now and concentrate on the 3 and 4. So, it takes a vector of size 3 and embeds it into a vector space of 4 dimensions.

The linear module is basically just: $$ Y = W^{\top}X + B$$ where X, Y and B are vectors and W is a matrix.

If you notice in the output of the module, there are internal variables for W called weight and B called bias.

Let's verify if they actually work like that. Consider an initialization for X:



In [4]:

    
X = torch.randn(3)



In [5]:

    
X









    Out[5]:





-0.8625
 1.2032
 0.7942
[torch.DoubleTensor of size 3]

The forward pass

You can get the output from any torch module by doing a :forward(). Let's do that here to get Y.



In [6]:

    
Y = model:forward(X)



In [7]:

    
Y









    Out[7]:





-0.2760
 0.0477
 0.1237
 0.8479
[torch.DoubleTensor of size 4]

Now, let's replicate that as a math calculation.

First let's access the weight of that model.



In [8]:

    
W = model:get(1).weight



In [9]:

    
W









    Out[9]:





-0.0767 -0.4704  0.3049
-0.0658 -0.2164 -0.1403
 0.2269  0.2260  0.2707
-0.5127  0.0128 -0.1652
[torch.DoubleTensor of size 4x3]

The model:get(1) command accesses the first module of model which is the Linear module we just defined. Next we access the weight variable of that module by chaining a '.weight'.

Similarly, we can get the bias of that linear module like this:



In [43]:

    
B = model:get(1).bias



In [44]:

    
B









    Out[44]:





-0.0182
 0.3628
-0.1674
 0.5215
[torch.DoubleTensor of size 4]

Now we can actually do some math. The Linear module can be expressed in torch maths as follows:



In [45]:

    
-- The * operator is overloaded with matrix multiplication
W*X+B









    Out[45]:





-0.2760
 0.0477
 0.1237
 0.8479
[torch.DoubleTensor of size 4]

Compare this with the output of the above model.



In [13]:

    
model:get(1).output









    Out[13]:





-0.2760
 0.0477
 0.1237
 0.8479
[torch.DoubleTensor of size 4]

Yes, the modules store their previous outputs in the output variable. This is useful in calculating the gradient later on.

The backward pass

Just like the forward pass gave us the output, a backward pass through a module will give us its backpropagated gradient.

Before doing anything, we should clear our gradient parameters. What gradient parameters?

Let's print out the internals of model again:



In [14]:

    
model









    Out[14]:





nn.Sequential {
  [input -> (1) -> output]
  (1): nn.Linear(3 -> 4)
}
{
  gradInput : DoubleTensor - empty
  modules : 
    {
      1 : 
        nn.Linear(3 -> 4)
        {
          gradBias : DoubleTensor - size: 4
          weight : DoubleTensor - size: 4x3
          gradWeight : DoubleTensor - size: 4x3
          gradInput : DoubleTensor - empty
          bias : DoubleTensor - size: 4
          output : DoubleTensor - size: 4
        }
    }
  output : DoubleTensor - size: 4
}

Do you notice the grad variables? Here's a mapping between these variables and mathematical expressions:

$$gradBias = \frac{\partial E}{\partial B}$$
$$gradWeight = \frac{\partial E}{\partial W}$$
$$gradInput = \frac{\partial E}{\partial X}$$ if E is the final Energy/Error of your model

Torch's :backward() function allows you to calculate the above values easily. However, we need to perform one operation before doing that, which is:



In [15]:

    
model:zeroGradParameters()









    Out[15]:

This is essential because sometimes garbage values in the grad parameters spoil your gradients. Now, let's check out the values of these parameters.



In [16]:

    
model:get(1).gradBias









    Out[16]:





 0
 0
 0
 0
[torch.DoubleTensor of size 4]



In [17]:

    
model:get(1).gradWeight









    Out[17]:





 0  0  0
 0  0  0
 0  0  0
 0  0  0
[torch.DoubleTensor of size 4x3]



In [18]:

    
model:get(1).gradInput









    Out[18]:





[torch.DoubleTensor with no dimension]

Now, we perform the backward pass:



In [19]:

    
gradOutput = torch.Tensor{1,2,3,4}



In [20]:

    
-- format --> model:backward(input_passed, error_that_came_from_previous_modules)

model:backward(X, gradOutput)









    Out[20]:





-1.5783
-0.1742
 0.1756
[torch.DoubleTensor of size 3]

The second parameter of backward (gradOutput) contains the error passed down from the layers above this layer. However, since we don't have any modules above the Linear module, we pass a vector of 2s (just for a demonstration). The mapping between this and a mathematical expression is: $$gradOutput = \frac{\partial E}{\partial Y}$$ if E is the final Energy/Error of your model.

Now, let's confirm that the values are being calculated as per their respective mathematical expressions.

First let's write down the expression for Y. $$Y = \begin{bmatrix} w_{11} x_1 + w_{12} x_2 + w_{13} x_3 + b_1\\ w_{21} x_1 + w_{22} x_2 + w_{23} x_3 + b_2\\ w_{31} x_1 + w_{32} x_2 + w_{33} x_3 + b_3\\ w_{41} x_1 + w_{42} x_2 + w_{43} x_3 + b_4 \end{bmatrix} $$

According to analytical expression for $\frac{\partial E}{\partial B}$, $$ \frac{\partial E}{\partial B} = \frac{\partial E}{\partial Y} \frac{\partial Y}{\partial B}$$ Now, $\frac{\partial Y}{\partial B}$ is a jacobian of vector Y w.r.t. vector B. This is actually equal to the identity matrix I. Therefore, $$ \frac{\partial E}{\partial B} = \frac{\partial E}{\partial Y} I = \frac{\partial E}{\partial Y}$$



In [21]:

    
model:get(1).gradBias









    Out[21]:





 1
 2
 3
 4
[torch.DoubleTensor of size 4]

According to analytical derivation, $$\frac{\partial E}{\partial W} = \frac{\partial E}{\partial Y} \frac{\partial Y}{\partial W} = \frac{\partial E}{\partial Y} X = \begin{bmatrix} x_1 \frac{\partial E}{\partial y_1} & x_2 \frac{\partial E}{\partial y_1} & x_3 \frac{\partial E}{\partial y_1} \\ x_1 \frac{\partial E}{\partial y_2} & x_2 \frac{\partial E}{\partial y_2} & x_3 \frac{\partial E}{\partial y_2} \\ x_1 \frac{\partial E}{\partial y_3} & x_2 \frac{\partial E}{\partial y_3} & x_3 \frac{\partial E}{\partial y_3} \\ x_1 \frac{\partial E}{\partial y_4} & x_2 \frac{\partial E}{\partial y_4} & x_3 \frac{\partial E}{\partial y_4} \end{bmatrix}$$ We perform the top three operations below.



In [46]:

    
gradOutput:reshape(4,1)*X:reshape(1,3)









    Out[46]:





-0.8625  1.2032  0.7942
-1.7249  2.4064  1.5884
-2.5874  3.6097  2.3826
-3.4499  4.8129  3.1768
[torch.DoubleTensor of size 4x3]

Match this with:



In [25]:

    
model:get(1).gradWeight









    Out[25]:





-0.8625  1.2032  0.7942
-1.7249  2.4064  1.5884
-2.5874  3.6097  2.3826
-3.4499  4.8129  3.1768
[torch.DoubleTensor of size 4x3]

The analytical expression for $\frac{\partial E}{\partial X}$ is: $$\frac{\partial E}{\partial X} = \frac{\partial E}{\partial Y}\frac{\partial Y}{\partial X} = \begin{bmatrix} \frac{\partial E}{\partial y_1} & \frac{\partial E}{\partial y_2} & \frac{\partial E}{\partial y_3} & \frac{\partial E}{\partial y_4} \end{bmatrix} \times \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{bmatrix} $$

To confirm the above expression, here's the weight matrix:



In [29]:

    
W









    Out[29]:





-0.3449 -0.5608 -0.0103
-0.1104  0.0342 -0.2782
 0.0876  0.3920  0.2362
 0.0042  0.3910  0.5008
[torch.DoubleTensor of size 4x3]

If we perform the calculation mentioned above, we get:



In [40]:

    
W:transpose(1,2)*gradOutput:reshape(4,1)









    Out[40]:





-1.5783
-0.1742
 0.1756
[torch.DoubleTensor of size 3x1]

Compare this with:



In [41]:

    
model:get(1).gradInput









    Out[41]:





-1.5783
-0.1742
 0.1756
[torch.DoubleTensor of size 3]



In [ ]: