In machine learning, we train models to get better and better as a function of experience.
Usually, getting better means minimizing a loss function, i.e. a score that answers -- how bad is our model?
With neural networks, we choose loss functions to be differentiable with respect to our parameters.
Put simply, this means that for each of the model's parameters, we can determine how much increasing or decreasing it might affect the loss.
While the calculations are straightforward, for complex models, working it out by hand can be a pain.
MXNet's autograd package expedites this work by automatically calculating derivatives.
And while most other libraries require that we compile a symbolic graph to take automatic derivatives, mxnet.autograd, like PyTorch, allows you to take derivatives while writing ordinary imperative code.
Every time you make a pass through your model, autograd builds a graph on the fly, through which it can immediately backpropagate gradients.
Let's go through it step by step.
For this tutorial, we'll only need to import mxnet.ndarray, and mxnet.autograd.



In [1]:

    
import mxnet as mx
from mxnet import nd, autograd
mx.random.seed(1)

Attaching gradients

As a toy example, let's say that we are interested in differentiating a function f = 2 * (x ** 2) with respect to parameter x.
We can start by assigning an initial value of x.



In [2]:

    
x = nd.array([[1, 2], [3, 4]])
x









    Out[2]:





[[1. 2.]
 [3. 4.]]
<NDArray 2x2 @cpu(0)>

Once we compute the gradient of f with respect to x, we'll need a place to store it.
In MXNet, we can tell an NDArray that we plan to store a gradient by invoking its attach_grad() method.



In [3]:

    
# Returns None type:
x.attach_grad()

Now we’re going to define the function f and MXNet will generate a computation graph on the fly.
It’s as if MXNet turned on a recording device and captured the exact path by which each variable was generated.
Note that building the computation graph requires a nontrivial amount of computation.
MXNet will only build the graph when explicitly told to do so.
We can instruct MXNet to start recording by placing code inside a with autograd.record(): block.



In [4]:

    
with autograd.record():
    y = x * 2
    z = y * x
    
print(x)
print(y)
print(z)









    



[[1. 2.]
 [3. 4.]]
<NDArray 2x2 @cpu(0)>

[[2. 4.]
 [6. 8.]]
<NDArray 2x2 @cpu(0)>

[[ 2.  8.]
 [18. 32.]]
<NDArray 2x2 @cpu(0)>

Backpropagation time.
When z has more than one entry, z.backward() is equivalent to mx.nd.sum(z).backward().
Got that?



In [5]:

    
# Returns None type
z.backward()

Now let's determine the expected output.
Remember that y = x * 2, and z = x * y, so z should be equal to 2 * x * x.
After doing backpropagation with z.backward(), we expect to get back gradient dz/dx as follows:
dy/dx = 2
dz/dx = 4 * x
If everything goes according to plan, x.grad should consist of an NDArray with the values [[4, 8], [12, 16]].



In [6]:

    
print(x.grad)









    



[[ 4.  8.]
 [12. 16.]]
<NDArray 2x2 @cpu(0)>

Head gradients and the chain rule

Sometimes when we call the backward method on an NDArray, e.g. y.backward(), where y is a function of x we are just interested in the derivative of y with respect to x.
Mathematicians write this as $\frac{dy(x)}{dx}$.
At other times, we may be interested in the gradient of z with respect to x, where z is a function of y, which in turn, is a function of x.
That is, we are interested in $\frac{d}{dx} z(y(x))$.
Knowing how to differentiate composite functions will come in handy here.
Recall that by the chain rule $\frac{d}{dx} z(y(x)) = \frac{dz(y)}{dy} \frac{dy(x)}{dx}$.
So, when y is part of a larger function z, and we want x.grad to store $\frac{dz}{dx}$, we can pass in the head gradient $\frac{dz}{dy}$ as an input to backward().
The default argument is nd.ones_like(y).
See Wikipedia and Khan Academy for more details.



In [7]:

    
with autograd.record():
    y = x * 2
    z = y * x
    
head_gradient = nd.array([[10, 1.], [.1, .01]])
z.backward(head_gradient)
print(x.grad)









    



[[40.    8.  ]
 [ 1.2   0.16]]
<NDArray 2x2 @cpu(0)>

Now that we know the basics, we can do some wild things with autograd, including building differentiable functions using Pythonic control flow.



In [8]:

    
a = nd.random_normal(shape=3)
a.attach_grad()



In [9]:

    
with autograd.record():
    b = a * 2
    while (nd.norm(b) < 1000).asscalar():
        b = b * 2
        
    if (mx.nd.sum(b) > 0).asscalar():
        c = b
    else:
        c = 100 * b



In [10]:

    
head_gradient = nd.array([0.01, 0.1, 1.0])
c.backward(head_gradient)



In [11]:

    
print(a.grad)









    



[  1024.  10240. 102400.]
<NDArray 3 @cpu(0)>



In [ ]:

Automatic differentiation with autograd

Attaching gradients

Head gradients and the chain rule

Automatic differentiation with `autograd`