We do backprop in the vectorized mode (i.e., for a mini-batch).
In [56]:
import torch
device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU
We do not use GPU here. If you want to try it, you need to install pytorch with CUDA first. Please refer to this link for more information.
The loss and the network are:
$$ l = agg(Y, T) \quad\text{, where } agg(A, B) = \sum_{i, j} (A_{ij} - B_{ij})^2 $$$$ Y = Z$$$$ Z = X \, W $$We describe the shapes of the variables below:
You should double check if the shapes all fit together.
In [57]:
n, d, C = 4, 3, 2
DOMAIN = 10
Autograd version
In [58]:
torch.manual_seed(42) # makes debugging and understanding a bit easier. can be commented out later.
T = torch.randint(DOMAIN, (n, C), device=device, requires_grad=False) # no grad on ground truth (random ints in [0, 10])
X = torch.randint(int(DOMAIN/2.9), (n, d), device=device, requires_grad=False) # no grad on data
W = torch.randint(int(DOMAIN/2.9), (d, C), device=device, requires_grad=True)
print(X)
print(W, '\n')
print(X.mm(W))
print(T)
Note that if we set requires_grad=True
for a Tensor, then any PyTorch operations on that Tensor will cause a computational graph to be constructed, allowing us to later perform backpropagation. If x
is a Tensor with requires_grad=True
, then after backpropagation x.grad
will be another Tensor holding the gradient of x
with respect to some scalar value.
In [59]:
# our network (and its output Y for input X)
Z = X.mm(W)
Y = Z
#loss1 = torch.nn.MSELoss(size_average=False)(Y, T) # non-averaged version
loss1 = torch.nn.MSELoss()(Y, T) # averaged version
loss2 = (Y-T).pow(2).sum()/2.0 # just sum all
# the two loss values should differ by a factor of n.
print(loss1)
print(loss2)
In [60]:
loss2.backward()
#print(Z.grad)
print(W.grad)
print(X.grad)
Now compare with the manually computed gradient:
$$ \frac{\partial l}{\partial W} = \frac{\partial l}{\partial Y} \frac{\partial Y}{\partial W} = X^{\top} (Y-T) $$
In [61]:
torch.t(X).mm(Y-T)
Out[61]:
The loss and the network are:
$$ l = agg(Y, T) \quad\text{, where } agg(A, B) = \sum_{i, j} (A_{ij} - B_{ij})^2 $$$$ Y = \sigma(Z) $$$$ Z = X \, W $$$\sigma(\cdot)$ is the usual sigmoid function.
We describe the shapes of the variables below:
You should double check if the shapes all fit together. If you do this, you will see that
In [62]:
#reset the gradients
W.grad.data.zero_()
sigma = torch.sigmoid
# our network (and its output Y for input X)
Z = X.mm(W)
Y = sigma(Z)
loss1 = torch.nn.MSELoss()(Y, T) # averaged version
loss2 = (Y-T).pow(2).sum()/2.0 # just sum all
# the two loss values should differ by a factor of n.
print(loss1)
print(loss2)
In [63]:
loss2.backward()
#print(Z.grad)
print(W.grad)
print(X.grad)
Now compare with the manually computed gradient:
$$ \frac{\partial l}{\partial W} = \frac{\partial l}{\partial Y} \frac{\partial Y}{\partial Z} \frac{\partial Y}{\partial W} = X^{\top} (Y \circ (1-Y) \circ (Y-T)) $$where $\circ$ is the Hadamard product (https://en.wikipedia.org/wiki/Hadamard_product_(matrices))
In [64]:
gradient_w = torch.t(X).mm(Y * (1-Y) * (Y-T))
gradient_w
Out[64]:
In [ ]:
You can read more about Gradient Checking. You need to implement it and check against the gradient_w
computed above.
If you use EPSILON = 1e-4
, you should get sth close to the following, where the three columns are:
-0.19160029292106628 -0.19073486328125 -0.0008654296398162842
-0.15006516873836517 -0.171661376953125 0.021596208214759827
-0.5024182796478271 -0.514984130859375 0.012565851211547852
-0.09660282731056213 -0.095367431640625 -0.0012353956699371338
-0.03004339337348938 -0.03814697265625 0.00810357928276062
-0.28447985649108887 -0.286102294921875 0.0016224384307861328
In [65]:
DEBUG = 1
epsilon = 1e-4
# input W, output the loss2
def f(WW):
Z = X.mm(WW)
Y = sigma(Z)
return (Y-T).pow(2).sum()/2.0 # just sum all
# gradient wrt W
numerical_gradient = []
with torch.no_grad(): # prevent accumulating the gradient.
for i in range(d * C):
# create perturbation in one axis
perturbation = torch.zeros(d, C) # same size as W
perturbation.view(d * C)[i] = 1.0
perturbation = perturbation * epsilon
#
if DEBUG:
print(perturbation)
# two loss2 values
loss2_left = f(W - perturbation)
loss2_right = f(W + perturbation)
# numeric gradient in one axis:
val = (loss2_right.item() - loss2_left.item() ) / (2 * epsilon)
numerical_gradient.append(val)
# end for
# print out
grad_w = torch.t(X).mm(Y * (1-Y) * (Y-T)).reshape(d*C) # redo it to make the code self-contained.
grad_w_numpy = grad_w.data.numpy()
for i in range(d * C):
print('{:>10}\t{:>10}\t{:>10}'.format(grad_w_numpy[i],
numerical_gradient[i],
grad_w_numpy[i]-numerical_gradient[i]))
In [ ]:
In [ ]: