Back Propagation

We do backprop in the vectorized mode (i.e., for a mini-batch).

Import the package and set device


In [56]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

We do not use GPU here. If you want to try it, you need to install pytorch with CUDA first. Please refer to this link for more information.

A linear example

The loss and the network are:

$$ l = agg(Y, T) \quad\text{, where } agg(A, B) = \sum_{i, j} (A_{ij} - B_{ij})^2 $$$$ Y = Z$$$$ Z = X \, W $$

We describe the shapes of the variables below:

  • $l$: (scalar)
  • $Y$: (matrix) $n \times C$; $C$ is the number of output classes
  • $Z$: (matrix) $n \times C$
  • $X$: (matrix) $n \times d$
  • $W$: (matrix) $d \times C$

You should double check if the shapes all fit together.

Set up a tiny instance


In [57]:
n, d, C = 4, 3, 2
DOMAIN = 10

Autograd version


In [58]:
torch.manual_seed(42) # makes debugging and understanding a bit easier. can be commented out later. 

T = torch.randint(DOMAIN, (n, C), device=device, requires_grad=False) # no grad on ground truth (random ints in [0, 10])
X = torch.randint(int(DOMAIN/2.9), (n, d), device=device, requires_grad=False) # no grad on data
W = torch.randint(int(DOMAIN/2.9), (d, C), device=device, requires_grad=True)
print(X)
print(W, '\n')
print(X.mm(W))
print(T)


tensor([[ 1.,  2.,  1.],
        [ 1.,  2.,  0.],
        [ 0.,  1.,  2.],
        [ 1.,  0.,  1.]])
tensor([[ 1.,  2.],
        [ 1.,  2.],
        [ 2.,  1.]]) 

tensor([[ 5.,  7.],
        [ 3.,  6.],
        [ 5.,  4.],
        [ 3.,  3.]])
tensor([[ 2.,  7.],
        [ 6.,  4.],
        [ 6.,  5.],
        [ 0.,  4.]])

Note that if we set requires_grad=True for a Tensor, then any PyTorch operations on that Tensor will cause a computational graph to be constructed, allowing us to later perform backpropagation. If x is a Tensor with requires_grad=True, then after backpropagation x.grad will be another Tensor holding the gradient of x with respect to some scalar value.


In [59]:
# our network (and its output Y for input X)
Z = X.mm(W)
Y = Z


#loss1 = torch.nn.MSELoss(size_average=False)(Y, T) # non-averaged version
loss1 = torch.nn.MSELoss()(Y, T) # averaged version
loss2 = (Y-T).pow(2).sum()/2.0 # just sum all

# the two loss values should differ by a factor of n. 
print(loss1)
print(loss2)


tensor(4.2500)
tensor(17.)

In [60]:
loss2.backward()
#print(Z.grad)
print(W.grad)
print(X.grad)


tensor([[ 3.,  1.],
        [-1.,  3.],
        [ 4., -3.]])
None

Now compare with the manually computed gradient:

$$ \frac{\partial l}{\partial W} = \frac{\partial l}{\partial Y} \frac{\partial Y}{\partial W} = X^{\top} (Y-T) $$

In [61]:
torch.t(X).mm(Y-T)


Out[61]:
tensor([[ 3.,  1.],
        [-1.,  3.],
        [ 4., -3.]])

A non-linear example

The loss and the network are:

$$ l = agg(Y, T) \quad\text{, where } agg(A, B) = \sum_{i, j} (A_{ij} - B_{ij})^2 $$$$ Y = \sigma(Z) $$$$ Z = X \, W $$

$\sigma(\cdot)$ is the usual sigmoid function.

We describe the shapes of the variables below:

  • $l$: (scalar)
  • $Y$: (matrix) $n \times C$; $C$ is the number of output classes
  • $Z$: (matrix) $n \times C$
  • $X$: (matrix) $n \times d$
  • $W$: (matrix) $d \times C$

You should double check if the shapes all fit together. If you do this, you will see that

  • The non-linear transformation, $\sigma$, is applied element-wise to the input matrix.


In [62]:
#reset the gradients
W.grad.data.zero_()

sigma = torch.sigmoid

# our network (and its output Y for input X)
Z = X.mm(W)
Y = sigma(Z)

loss1 = torch.nn.MSELoss()(Y, T) # averaged version
loss2 = (Y-T).pow(2).sum()/2.0 # just sum all

# the two loss values should differ by a factor of n. 
print(loss1)
print(loss2)


tensor(15.3651)
tensor(61.4605)

In [63]:
loss2.backward()
#print(Z.grad)
print(W.grad)
print(X.grad)


tensor([[-0.1917, -0.1505],
        [-0.5027, -0.0967],
        [-0.0302, -0.2851]])
None

Now compare with the manually computed gradient:

$$ \frac{\partial l}{\partial W} = \frac{\partial l}{\partial Y} \frac{\partial Y}{\partial Z} \frac{\partial Y}{\partial W} = X^{\top} (Y \circ (1-Y) \circ (Y-T)) $$

where $\circ$ is the Hadamard product (https://en.wikipedia.org/wiki/Hadamard_product_(matrices))


In [64]:
gradient_w = torch.t(X).mm(Y * (1-Y) * (Y-T))
gradient_w


Out[64]:
tensor([[-0.1917, -0.1505],
        [-0.5027, -0.0967],
        [-0.0302, -0.2851]])

In [ ]:

Exercise: Gradient checking

You can read more about Gradient Checking. You need to implement it and check against the gradient_w computed above.

If you use EPSILON = 1e-4, you should get sth close to the following, where the three columns are:

  • gradient_w[i]
  • numerically computed gradient_w[i]
  • difference of the two
-0.19160029292106628    -0.19073486328125   -0.0008654296398162842
-0.15006516873836517    -0.171661376953125  0.021596208214759827
-0.5024182796478271 -0.514984130859375  0.012565851211547852
-0.09660282731056213    -0.095367431640625  -0.0012353956699371338
-0.03004339337348938    -0.03814697265625   0.00810357928276062
-0.28447985649108887    -0.286102294921875  0.0016224384307861328

In [65]:
DEBUG = 1
epsilon = 1e-4


# input W, output the loss2
def f(WW): 
    Z = X.mm(WW)
    Y = sigma(Z)
    return (Y-T).pow(2).sum()/2.0 # just sum all
    

# gradient wrt W
numerical_gradient = []
with torch.no_grad(): # prevent accumulating the gradient. 
    
    for i in range(d * C): 
        # create perturbation in one axis
        perturbation = torch.zeros(d, C) # same size as W
        perturbation.view(d * C)[i] = 1.0
        perturbation = perturbation * epsilon 
        # 
        if DEBUG: 
            print(perturbation)

        # two loss2 values
        loss2_left  = f(W - perturbation)
        loss2_right = f(W + perturbation)

        # numeric gradient in one axis:
        val = (loss2_right.item() - loss2_left.item() ) / (2 * epsilon)
        numerical_gradient.append(val)
    # end for
    
    # print out
    grad_w = torch.t(X).mm(Y * (1-Y) * (Y-T)).reshape(d*C) # redo it to make the code self-contained.
    grad_w_numpy = grad_w.data.numpy()
    for i in range(d * C):
        print('{:>10}\t{:>10}\t{:>10}'.format(grad_w_numpy[i], 
                                              numerical_gradient[i], 
                                              grad_w_numpy[i]-numerical_gradient[i]))


tensor(1.00000e-05 *
       [[10.0000,  0.0000],
        [ 0.0000,  0.0000],
        [ 0.0000,  0.0000]])
tensor(1.00000e-05 *
       [[ 0.0000, 10.0000],
        [ 0.0000,  0.0000],
        [ 0.0000,  0.0000]])
tensor(1.00000e-05 *
       [[ 0.0000,  0.0000],
        [10.0000,  0.0000],
        [ 0.0000,  0.0000]])
tensor(1.00000e-05 *
       [[ 0.0000,  0.0000],
        [ 0.0000, 10.0000],
        [ 0.0000,  0.0000]])
tensor(1.00000e-05 *
       [[ 0.0000,  0.0000],
        [ 0.0000,  0.0000],
        [10.0000,  0.0000]])
tensor(1.00000e-05 *
       [[ 0.0000,  0.0000],
        [ 0.0000,  0.0000],
        [ 0.0000, 10.0000]])
-0.19168421626091003	-0.19073486328125	-0.0009493529796600342
-0.1505398154258728	-0.171661376953125	0.021121561527252197
-0.5027213096618652	-0.514984130859375	0.012262821197509766
-0.09670328348875046	-0.095367431640625	-0.0013358518481254578
-0.03022773563861847	-0.03814697265625	0.00791923701763153
-0.28507155179977417	-0.286102294921875	0.00103074312210083

In [ ]:


In [ ]: