Update for PyTorch 0.4:
Earlier versions used Variable
to wrap tensors with different properties. Since version 0.4, Variable
is merged with tensor
, in other words, Variable
is NOT needed anymore. The flag require_grad
can be directly set in tensor
. Accordingly, this post is also updated.
Having heard about the announcement about Theano from Bengio lab , as a Theano user, I am happy and sad to see the fading of the old hero, caused by many raising stars. Sad to see it is too old to compete with its industrial competitors, and happy to have so many excellent deep learning frameworks to choose from. Recently I started translating some of my old codes to Pytorch and have been really impressed by its dynamic nature and clearness. But at the very beginning, I was very confused by the backward()
function when reading the tutorials and documentations. This motivated me to write this post in order for other Pytorch beginners to ease the understanding a bit. And I'll assume that you already know the autograd
module and what a Variable
is, but are a little confused by definition of backward()
.
First let's recall the gradient computing under mathematical notions. For an independent variable $x$ (scalar or vector), the whatever operation on $x$ is $y = f(x)$. Then the gradient of $y$ w.r.t $x_i$s is $$\begin{align}\nabla y&=\begin{bmatrix} \frac{\partial y}{\partial x_1}\\ \frac{\partial y}{\partial x_2}\\ \vdots \end{bmatrix} \end{align}. $$ Then for a specific point of $x=[X_1, X_2, \dots]$, we'll get the gradient of $y$ on that point as a vector. With these notions in mind, the following things are a bit confusing at the beginning
Mathematically, we would say "The gradients of a function w.r.t. the independent variables", whereas the .grad
is attached to the leaf tensor
s. In Theano and Tensorflow, the computed gradients are stored separately in a variable. But with a moment of adjustment, it is fairly easy to buy that. In Pytorch it is also possible to get the .grad
for intermediate Variable
s with help of register_hook
function
The parameter grad_variables
of the function torch.autograd.backward(variables, grad_tensors=None, retain_graph=None, create_graph=None, retain_variables=None, grad_variables=None)
is not straightforward for knowing its functionality. **note that grad_variables
is deprecated, use grad_tensors
instead.
What is retain_graph
doing?
In [1]:
import torch as T
import torch.autograd
import numpy as np
In [2]:
'''
Define a scalar variable, set requires_grad to be true to add it to backward path for computing gradients
It is actually very simple to use backward()
first define the computation graph, then call backward()
'''
x = T.randn(1, 1, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
print('y', y)
#define one more operation to check the chain rule
z = y ** 3
print('z', z)
The simple operations defined a forward path $z=(2x)^3$, $z$ will be the final output tensor
we would like to compute gradient: $dz=24x^2dx$, which will be passed to the parameter tensors
in backward()
function.
In [3]:
#yes, it is just as simple as this to compute gradients:
z.backward()
In [9]:
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad, 'Requires gradient?', x.grad.requires_grad) # note that x.grad is also a tensor
The gradients of both $y$ and $z$ are None, since the function returns the gradient for the leaves, which is $x$ in this case. At the very beginning, I was assuming something like this:
x gradient: None
y gradient: None
z gradient: tensor([11.6105])
,
since the gradient is calculated for the final output $z$.
With a blink of thinking, we could figure out it would be practically chaos if $x$ is a multi-dimensional vector. x.grad
should be interpreted as the gradient of $z$ at $x$.
Keep the same forward path, then do backward
by only setting retain_graph
as True
.
In [48]:
x = T.randn(1, 1, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#define one more operation to check the chain rule
z = y ** 3
z.backward(retain_graph=True)
print('Keeping the default value of grad_tensors gives')
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad)
Testing the explicit default value, which should give the same result. For the same graph which is retained, DO NOT forget to zero the gradient before recalculate the gradients.
In [49]:
x.grad.data.zero_()
z.backward(T.Tensor([[1]]), retain_graph=True)
print('Set grad_tensors to 1 gives')
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad)
Then what about other values, let's try 0.1 and 0.5.
In [54]:
x.grad.data.zero_()
z.backward(T.Tensor([[0.1]]), retain_graph=True)
print('Set grad_tensors to 0.1 gives')
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad)
In [55]:
x.grad.data.zero_()
z.backward(T.FloatTensor([[0.5]]), retain_graph=True)
print('Modifying the default value of grad_variables to 0.1 gives')
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)
It looks like the elements of grad_tensors
act as scaling factors. Now let's set $x$ to be a $2\times 2$matrix. Note that $z$ will also be a matrix. (Always use the latest version, backward
had been improved a lot from earlier version, becoming much easier to understand.)
In [67]:
x = T.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#define one more operation to check the chain rule
z = y ** 3
print('z shape:', z.size())
z.backward(T.FloatTensor([[1, 1], [1, 1]]), retain_graph=True)
print('x gradient for its all elements:\n', x.grad)
print()
x.grad.data.zero_() #the gradient for x will be accumulated, it needs to be cleared.
z.backward(T.FloatTensor([[0, 1], [0, 1]]), retain_graph=True)
print('x gradient for the second column:\n', x.grad)
print()
x.grad.data.zero_()
z.backward(T.FloatTensor([[1, 1], [0, 0]]), retain_graph=True)
print('x gradient for the first row:\n', x.grad)
We can clearly see the gradients of $z$ are computed w.r.t to each dimension of $x$, because the operations are all element-wise.
Then what if we render the output one-dimensional (scalar) while $x$ is two-dimensional. This is a real simplified scenario of neural networks. $$f(x)=\frac{1}{n}\sum_i^n(2x_i)^3$$ $$f'(x)=\frac{1}{n}\sum_i^n24x_i^2$$
In [77]:
x = T.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#print('y', y)
#define one more operation to check the chain rule
z = y ** 3
out = z.mean()
print('out', out)
out.backward(retain_graph=True)
print('x gradient:\n', x.grad)
We will get complaints if the grad_tensors
is specified for the scalar function.
In [78]:
x.grad.data.zero_()
out.backward(T.FloatTensor([[1, 1], [1, 1]]), retain_graph=True)
print('x gradient', x.grad)
In [82]:
x = T.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#print('y', y)
#define one more operation to check the chain rule
z = y ** 3
out = z.mean()
print('out', out)
out.backward() #without setting retain_graph to be true, it is alright for first time of backward.
print('x gradient', x.grad)
x.grad.data.zero_()
out.backward() #Now we get complaint saying that no graph is available for tracing back.
print('x gradient', x.grad)
In [ ]: