In [1]:

    
from __future__ import print_function
import torch as T
import torch.autograd
from torch.autograd import Variable
import numpy as np

神经网络算法实现的核心之一是对代价函数的反向求导，Theano和Tensorflow中都定义了求导的符号函数，同样地，作为深度学习平台，自动求导（autograd）功能在pytorch中也扮演着核心功能，不同的是，pytorch的动态图功能使其更灵活（define by run），比如甚至在每次迭代中都可以通过改变pytorch中Variable的属性，从而使其加入亦或退出反向求导图，这个功能在某些应用中会特别使用，比如在训练的后半段，我们需要更新的只有后层的参数，那么只需要在前层参数的Variable设置成不要求求导。

变量

autograd.Variable是pytorch设计理念的核心之一，它将tensor封装成Variable，并支持绝大多数tensor上的操作，同时赋予其两个极其重要的属性：requires_grad和volatile，Variable还具有三个属性： .data用于存储Variable的数值，.grad也为变量，用于存储Variable的导数值,.grad_fn (creator) 是生成该Variable的函数，当Variable为用户自定义时，其值为"None". 细节参见源码Variable



In [13]:

    
x = Variable(T.ones(2,2), requires_grad=True)
print x









    



Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]



In [14]:

    
y = T.exp(x + 2)
yy = T.exp(-x-2)
print y









    



Variable containing:
 20.0855  20.0855
 20.0855  20.0855
[torch.FloatTensor of size 2x2]



In [15]:

    
z = (y + yy)/2
out = z.mean()
print z, out









    



Variable containing:
 10.0677  10.0677
 10.0677  10.0677
[torch.FloatTensor of size 2x2]
 Variable containing:
 10.0677
[torch.FloatTensor of size 1]



In [16]:

    
make_dot(out)









    Out[16]:



In [29]:

    
out.backward(T.FloatTensor(1), retain_graph=True)



In [30]:

    
x.grad









    Out[30]:





Variable containing:
-1.2072e+21 -1.2072e+21
-1.2072e+21 -1.2072e+21
[torch.FloatTensor of size 2x2]



In [31]:

    
T.randn(1,1)









    Out[31]:





-0.7466
[torch.FloatTensor of size 1x1]



In [44]:

    
from __future__ import print_function
xx = Variable(torch.randn(1,1), requires_grad = True)
print(xx)
yy = 3*xx
zz = yy**2

#yy.register_hook(print)
zz.backward(T.FloatTensor([0.1]))
print(xx.grad)









    



Variable containing:
 1.1988
[torch.FloatTensor of size 1x1]

Variable containing:
 2.1578
[torch.FloatTensor of size 1x1]

A simple numpy implementation of one hidden layer neural network.

In this implementation, for each update of $w_i$, both the forward and backward passes need to computed.



In [4]:

    
# y_pred = w2*(relu(w1*x))
# loss = 0.5*sum (y_pred - y)^2
import numpy as np

N, D_in, D_hidden, D_out = 50, 40, 100, 10

x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

w1 = np.random.randn(D_in, D_hidden)
w2 = np.random.randn(D_hidden, D_out)

learning_rate = 0.0001
for t in range(100):
    ### 前向通道
    h = x.dot(w1) #50x40 and 40x100 produce 50x100
    h_relu = np.maximum(h, 0)  #this has to be np.maximum as it takes two input arrays and do element-wise max, 50x100
    y_pred = h_relu.dot(w2) #50x100 and 100x10 produce 50x10
    #print y_pred.shape
    
    ### 误差函数
    loss = 0.5 * np.sum(np.square(y_pred - y))
    
    
    ### 反向通道
    grad_y_pred = y_pred - y #50x10
    grad_w2 = h_relu.T.dot(grad_y_pred) #50x100 and 50x10 should produce 100x10, so transpose h_relu
    grad_h_relu = grad_y_pred.dot(w2.T) #50x10 and 100x10 should produce 50x100, so transpose w2
    grad_h = grad_h_relu.copy() #make a copy of 
    grad_h[grad_h < 0] = 0      #
    grad_w1 = x.T.dot(grad_h)     #50x100 and 50x40 should produce 40x100
    
    w1 = w1 - learning_rate * grad_w1
    w2 = w2 - learning_rate * grad_w2

with very slight modifications, we could end up with the implementation of the same algorithm in PyTorch



In [7]:

    
import torch

N, D_in, D_hidden, D_out = 50, 40, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, D_hidden)
w2 = torch.randn(D_hidden, D_out)

learning_rate = 0.0001
for t in range(100):
    h = x.mm(w1) #50x40 and 40x100 produce 50x100
    #h = x.matmul(w1) #50x40 and 40x100 produce 50x100, matmul for checking
    h_relu = h.clamp(min=0)  #this has to be np.maximum as it takes two input arrays and do element-wise max, 50x100
    y_pred = h_relu.mm(w2) #50x100 and 100x10 produce 50x10
    #print y_pred.shape
    
    loss = 0.5 * (y_pred - y).pow(2).sum()
    
    grad_y_pred = y_pred - y #50x10
    grad_w2 = h_relu.t().mm(grad_y_pred) #50x100 and 50x10 should produce 100x10, so transpose h_relu
    grad_h_relu = grad_y_pred.mm(w2.t()) #50x10 and 100x10 should produce 50x100, so transpose w2
    grad_h = grad_h_relu.clone() #make a copy
    grad_h[grad_h < 0] = 0      #
    grad_w1 = x.t().mm(grad_h)     #50x100 and 50x40 should produce 40x100
    
    w1 = w1 - learning_rate * grad_w1
    w2 = w2 - learning_rate * grad_w2

Now with the autograd functionality in PyTorch, we could see the ease of doing backpropagation, calculating gradients for two layers networks is not a big deal but it becomes much more complicated when the number of layers grows.



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Basic matrix multiplication in Pytorch

PyTorch provided torch.dot(), torch.mm(), torch.matmul(), *, for basic matrix multiplication. It is worthy of noting the differences among them, torch.dot(a, b) gives the inner product of 1-D vectors $a$ and $b$, torch.mm(a, b) gives the matrix multiplication of 2-D matrices, and torch.matmul() operates on two tensors. It seems that torch.matmul() can replace both torch.dot() and torch.mm(), but not vice versa. Finally * simply calculates the elementwise products, i.e. the Hadamard product.

Advanced matrix multiplication

torch.bmm(A, B): batch matrix multiplication for 3D tensors, $A_{b\times n\times p}$ and $B_{b\times p\times m}$ will produce 3D tensor of shape $b\times n\times m$

torch.baddbmm(A, B):

torch.addbmm(A, B):

torch.addmm(A, B):



In [ ]:



In [ ]:



In [ ]: