In [1]:
from __future__ import print_function
import torch as T
import torch.autograd
from torch.autograd import Variable
import numpy as np

神经网络算法实现的核心之一是对代价函数的反向求导,Theano和Tensorflow中都定义了求导的符号函数,同样地,作为深度学习平台,自动求导(autograd)功能在pytorch中也扮演着核心功能,不同的是,pytorch的动态图功能使其更灵活(define by run), 比如甚至在每次迭代中都可以通过改变pytorch中Variable的属性,从而使其加入亦或退出反向求导图,这个功能在某些应用中会特别使用,比如在训练的后半段,我们需要更新的只有后层的参数,那么只需要在前层参数的Variable设置成不要求求导。

变量

autograd.Variable是pytorch设计理念的核心之一,它将tensor封装成Variable,并支持绝大多数tensor上的操作,同时赋予其两个极其重要的属性:requires_gradvolatile,Variable还具有三个属性: .data用于存储Variable的数值,.grad也为变量,用于存储Variable的导数值,.grad_fn (creator) 是生成该Variable的函数,当Variable为用户自定义时,其值为"None". 细节参见源码Variable


In [13]:
x = Variable(T.ones(2,2), requires_grad=True)
print x


Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]


In [14]:
y = T.exp(x + 2)
yy = T.exp(-x-2)
print y


Variable containing:
 20.0855  20.0855
 20.0855  20.0855
[torch.FloatTensor of size 2x2]


In [15]:
z = (y + yy)/2
out = z.mean()
print z, out


Variable containing:
 10.0677  10.0677
 10.0677  10.0677
[torch.FloatTensor of size 2x2]
 Variable containing:
 10.0677
[torch.FloatTensor of size 1]


In [16]:
make_dot(out)


Out[16]:
%3 140396231155792 MeanBackward

In [29]:
out.backward(T.FloatTensor(1), retain_graph=True)

In [30]:
x.grad


Out[30]:
Variable containing:
-1.2072e+21 -1.2072e+21
-1.2072e+21 -1.2072e+21
[torch.FloatTensor of size 2x2]

In [31]:
T.randn(1,1)


Out[31]:
-0.7466
[torch.FloatTensor of size 1x1]

In [44]:
from __future__ import print_function
xx = Variable(torch.randn(1,1), requires_grad = True)
print(xx)
yy = 3*xx
zz = yy**2

#yy.register_hook(print)
zz.backward(T.FloatTensor([0.1]))
print(xx.grad)


Variable containing:
 1.1988
[torch.FloatTensor of size 1x1]

Variable containing:
 2.1578
[torch.FloatTensor of size 1x1]

A simple numpy implementation of one hidden layer neural network.

In this implementation, for each update of $w_i$, both the forward and backward passes need to computed.


In [4]:
# y_pred = w2*(relu(w1*x))
# loss = 0.5*sum (y_pred - y)^2
import numpy as np

N, D_in, D_hidden, D_out = 50, 40, 100, 10

x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

w1 = np.random.randn(D_in, D_hidden)
w2 = np.random.randn(D_hidden, D_out)

learning_rate = 0.0001
for t in range(100):
    ### 前向通道
    h = x.dot(w1) #50x40 and 40x100 produce 50x100
    h_relu = np.maximum(h, 0)  #this has to be np.maximum as it takes two input arrays and do element-wise max, 50x100
    y_pred = h_relu.dot(w2) #50x100 and 100x10 produce 50x10
    #print y_pred.shape
    
    ### 误差函数
    loss = 0.5 * np.sum(np.square(y_pred - y))
    
    
    ### 反向通道
    grad_y_pred = y_pred - y #50x10
    grad_w2 = h_relu.T.dot(grad_y_pred) #50x100 and 50x10 should produce 100x10, so transpose h_relu
    grad_h_relu = grad_y_pred.dot(w2.T) #50x10 and 100x10 should produce 50x100, so transpose w2
    grad_h = grad_h_relu.copy() #make a copy of 
    grad_h[grad_h < 0] = 0      #
    grad_w1 = x.T.dot(grad_h)     #50x100 and 50x40 should produce 40x100
    
    w1 = w1 - learning_rate * grad_w1
    w2 = w2 - learning_rate * grad_w2

with very slight modifications, we could end up with the implementation of the same algorithm in PyTorch


In [7]:
import torch

N, D_in, D_hidden, D_out = 50, 40, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, D_hidden)
w2 = torch.randn(D_hidden, D_out)

learning_rate = 0.0001
for t in range(100):
    h = x.mm(w1) #50x40 and 40x100 produce 50x100
    #h = x.matmul(w1) #50x40 and 40x100 produce 50x100, matmul for checking
    h_relu = h.clamp(min=0)  #this has to be np.maximum as it takes two input arrays and do element-wise max, 50x100
    y_pred = h_relu.mm(w2) #50x100 and 100x10 produce 50x10
    #print y_pred.shape
    
    loss = 0.5 * (y_pred - y).pow(2).sum()
    
    grad_y_pred = y_pred - y #50x10
    grad_w2 = h_relu.t().mm(grad_y_pred) #50x100 and 50x10 should produce 100x10, so transpose h_relu
    grad_h_relu = grad_y_pred.mm(w2.t()) #50x10 and 100x10 should produce 50x100, so transpose w2
    grad_h = grad_h_relu.clone() #make a copy
    grad_h[grad_h < 0] = 0      #
    grad_w1 = x.t().mm(grad_h)     #50x100 and 50x40 should produce 40x100
    
    w1 = w1 - learning_rate * grad_w1
    w2 = w2 - learning_rate * grad_w2

Now with the autograd functionality in PyTorch, we could see the ease of doing backpropagation, calculating gradients for two layers networks is not a big deal but it becomes much more complicated when the number of layers grows.


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Basic matrix multiplication in Pytorch

PyTorch provided torch.dot(), torch.mm(), torch.matmul(), *, for basic matrix multiplication. It is worthy of noting the differences among them, torch.dot(a, b) gives the inner product of 1-D vectors $a$ and $b$, torch.mm(a, b) gives the matrix multiplication of 2-D matrices, and torch.matmul() operates on two tensors. It seems that torch.matmul() can replace both torch.dot() and torch.mm(), but not vice versa. Finally * simply calculates the elementwise products, i.e. the Hadamard product.

Advanced matrix multiplication

torch.bmm(A, B): batch matrix multiplication for 3D tensors, $A_{b\times n\times p}$ and $B_{b\times p\times m}$ will produce 3D tensor of shape $b\times n\times m$

torch.baddbmm(A, B):

torch.addbmm(A, B):

torch.addmm(A, B):


In [ ]:


In [ ]:


In [ ]: