起手式，導入 numpy, matplotlib



In [1]:

    
from PIL import Image
import numpy as np
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('bmh')
matplotlib.rcParams['figure.figsize']=(8,5)

使用之前下載的 mnist 資料，載入訓練資料 train_set 和測試資料 test_set



In [2]:

    
import gzip
import pickle
with gzip.open('../Week02/mnist.pkl.gz', 'rb') as f:
    train_set, validation_set, test_set = pickle.load(f, encoding='latin1')
    
train_X, train_y = train_set
validation_X, validation_y = validation_set
test_X, test_y = test_set

之前的看圖片函數



In [3]:

    
from IPython.display import display
def showX(X):
    int_X = (X*255).clip(0,255).astype('uint8')
    # N*784 -> N*28*28 -> 28*N*28 -> 28 * 28N
    int_X_reshape = int_X.reshape(-1,28,28).swapaxes(0,1).reshape(28,-1)
    display(Image.fromarray(int_X_reshape))
# 訓練資料， X 的前 20 筆
showX(train_X[:20])

train_set 是用來訓練我們的模型用的

我們的模型是很簡單的 logistic regression 模型，用到的參數只有一個 784x10 的矩陣 W 和一個長度 10 的向量 b。

我們先用均勻隨機亂數來設定 W 和 b 。



In [4]:

    
W = np.random.uniform(low=-1, high=1, size=(28*28,10))
b = np.random.uniform(low=-1, high=1, size=10)

完整的模型如下將圖片看成是長度 784 的向量 x

計算 $Wx+b$，然後再取 $exp$。最後得到的十個數值。將這些數值除以他們的總和。我們希望出來的數字會符合這張圖片是這個數字的機率。

$ \Pr(Y=i|x, W, b) = \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}$

先拿第一筆資料試試看， x 是輸入。 y 是這張圖片對應到的數字(以這個例子來說 y=5)。



In [5]:

    
x = train_X[0]
y = train_y[0]
showX(x)
y

先計算 $e^{Wx+b} $



In [6]:

    
Pr = np.exp(x @ W + b)
Pr.shape









    Out[6]:





(10,)

然後 normalize，讓總和變成 1 （符合機率的意義）



In [7]:

    
Pr = Pr/Pr.sum()
Pr









    Out[7]:





array([  1.11201478e-03,   2.32129668e-06,   3.47186834e-03,
         3.64416088e-03,   9.89922844e-01,   3.83462031e-04,
         4.67890738e-09,   3.02581069e-04,   1.11720864e-07,
         1.16063080e-03])

由於 $W$ 和 $b$ 都是隨機設定的，所以上面我們算出的機率也是隨機的。

正確解是 $y=5$，運氣好有可能猜中

為了要評斷我們的預測的品質，要設計一個評斷誤差的方式，我們用的方法如下（不是常見的方差，而是用熵的方式來算，好處是容易微分，效果好）

$ loss = - \log(\Pr(Y=y|x, W,b)) $

上述的誤差評分方式，常常稱作 error 或者 loss，數學式可能有點費解。實際計算其實很簡單，就是下面的式子



In [8]:

    
loss = -np.log(Pr[y])
loss









    Out[8]:





7.8662699474374271

想辦法改進。

我們用一種被稱作是 gradient descent 的方式來改善我們的誤差。

因為我們知道 gradient 是讓函數上升最快的方向。所以我們如果朝 gradient 的反方向走一點點（也就是下降最快的方向），那麼得到的函數值應該會小一點。

記得我們的變數是 $W$ 和 $b$ (裡面總共有 28*20+10 個變數)，所以我們要把 $loss$ 對 $W$ 和 $b$ 裡面的每一個參數來偏微分。

還好這個偏微分是可以用手算出他的形式，而最後偏微分的式子也不會很複雜。

$loss$ 展開後可以寫成 $loss = \log(\sum_j e^{W_j x + b_j}) - W_i x - b_i$

對 $k \neq i$ 時, $loss$ 對 $b_k$ 的偏微分是 $$ \frac{e^{W_k x + b_k}}{\sum_j e^{W_j x + b_j}} = \Pr(Y=k | x, W, b)$$ 對 $k = i$ 時, $loss$ 對 $b_k$ 的偏微分是 $$ \Pr(Y=k | x, W, b) - 1$$



In [9]:

    
gradb = Pr.copy()
gradb[y] -= 1
print(gradb)









    



[  1.11201478e-03   2.32129668e-06   3.47186834e-03   3.64416088e-03
   9.89922844e-01  -9.99616538e-01   4.67890738e-09   3.02581069e-04
   1.11720864e-07   1.16063080e-03]

對 $W$ 的偏微分也不難

對 $k \neq i$ 時, $loss$ 對 $W_{k,t}$ 的偏微分是 $$ \frac{e^{W_k x + b_k} W_{k,t} x_t}{\sum_j e^{W_j x + b_j}} = \Pr(Y=k | x, W, b) x_t$$ 對 $k = i$ 時, $loss$ 對 $W_{k,t}$ 的偏微分是 $$ \Pr(Y=k | x, W, b) x_t - x_t$$



In [10]:

    
print(Pr.shape, x.shape, W.shape)
gradW = x.reshape(784,1) @ Pr.reshape(1,10)
gradW[:, y] -= x









    



(10,) (784,) (784, 10)

算好 gradient 後，讓 W 和 b 分別往 gradient 反方向走一點點，得到新的 W 和 b



In [11]:

    
W -= 0.1 * gradW
b -= 0.1 * gradb

再一次計算 $\Pr$ 以及 $loss$



In [12]:

    
Pr = np.exp(x @ W + b)
Pr = Pr/Pr.sum()
loss = -np.log(Pr[y])
loss









    Out[12]:





0.0026163360947788227

Q

看看 Pr ，然後找出機率最大者， predict y 值
再跑一遍上面程序，看看誤差是否變小？
拿其他的測試資料來看看，我們的 W, b 學到了什麼？

我們將同樣的方式輪流對五萬筆訓練資料來做，看看情形會如何



In [13]:

    
W = np.random.uniform(low=-1, high=1, size=(28*28,10))
b = np.random.uniform(low=-1, high=1, size=10)
score = 0
N=50000*20
d = 0.001
learning_rate = 1e-2
for i in range(N):
    if i%50000==0:
        print(i, "%5.3f%%"%(score*100))
    x = train_X[i%50000]
    y = train_y[i%50000]
    Pr = np.exp( x @ W +b)
    Pr = Pr/Pr.sum()
    loss = -np.log(Pr[y])
    score *=(1-d)
    if Pr.argmax() == y:
        score += d
    gradb = Pr.copy()
    gradb[y] -= 1
    gradW = x.reshape(784,1) @ Pr.reshape(1,10)
    gradW[:, y] -= x
    W -= learning_rate * gradW
    b -= learning_rate * gradb









    



0 0.000%
50000 87.490%
100000 89.497%
150000 90.022%
200000 90.377%
250000 90.599%
300000 91.002%
350000 91.298%
400000 91.551%
450000 91.613%
500000 91.678%
550000 91.785%
600000 91.792%
650000 91.889%
700000 91.918%
750000 91.946%
800000 91.885%
850000 91.955%
900000 91.954%
950000 92.044%

結果發現正確率大約是 92%，但這是對訓練資料而不是對測試資料

而且，一筆一筆的訓練資也有點慢，線性代數的特點就是能夠向量運算。如果把很多筆 $x$ 當成列向量組合成一個矩陣（然後叫做 $X$），由於矩陣乘法的原理，我們還是一樣計算 $WX+b$ ，就可以同時得到多筆結果。

下面的函數，可以一次輸入多筆 $x$，同時一次計算多筆 $x$ 的結果和準確率。



In [14]:

    
def compute_Pr(X):
    Pr = np.exp(X @ W + b)
    return Pr/Pr.sum(axis=1, keepdims=True)
def compute_accuracy(Pr, y):
    return (Pr.argmax(axis=1)==y).mean()

下面是更新過得訓練過程，當 i%100000 時，順便計算一下 test accuracy 和 valid accuracy。



In [15]:

    
%%timeit -r 1 -n 1
def compute_Pr(X):
    Pr = np.exp(X @ W + b)
    return Pr/Pr.sum(axis=1, keepdims=True)
def compute_accuracy(Pr, y):
    return (Pr.argmax(axis=1)==y).mean()

W = np.random.uniform(low=-1, high=1, size=(28*28,10))
b = np.random.uniform(low=-1, high=1, size=10)
score = 0
N=20000
batch_size = 128
learning_rate = 0.5
for i in range(0, N):
    if (i+1)%2000==0:        
        test_score = compute_accuracy(compute_Pr(test_X), test_y)*100        
        train_score = compute_accuracy(compute_Pr(train_X), train_y)*100
        print(i+1, "%5.2f%%"%test_score, "%5.2f%%"%train_score)
    # 隨機選出一些訓練資料出來
    rndidx = np.random.choice(train_X.shape[0], batch_size, replace=False)
    X, y  = train_X[rndidx], train_y[rndidx]
    # 一次計算所有的 Pr
    Pr = compute_Pr(X)
    # 計算平均 gradient 
    Pr_one_y = Pr-np.eye(10)[y]
    gradb = Pr_one_y.mean(axis=0)
    gradW = X.T @ (Pr_one_y) / batch_size
    # 更新 W 和 ba
    W -= learning_rate * gradW
    b -= learning_rate * gradb









    



2000 90.50% 90.47%
4000 91.17% 91.56%
6000 91.72% 92.03%
8000 91.86% 92.25%
10000 92.03% 92.52%
12000 92.14% 92.88%
14000 92.34% 92.81%
16000 92.29% 92.99%
18000 92.18% 93.13%
20000 92.06% 93.12%
1 loop, best of 1: 1min 8s per loop

最後得到的準確率是 92%-93%

不算完美，不過畢竟這只有一個矩陣而已。

光看數據沒感覺，我們來看看前十筆測試資料跑起來的情形

可以看到前十筆只有錯一個



In [16]:

    
Pr = compute_Pr(test_X[:10])
pred_y =Pr.argmax(axis=1)
for i in range(10):
    print(pred_y[i], test_y[i])
    showX(test_X[i])

看看前一百筆資料中，是哪些情況算錯



In [17]:

    
Pr = compute_Pr(test_X[:100])
pred_y = Pr.argmax(axis=1)
for i in range(100):
    if pred_y[i] != test_y[i]:
        print(pred_y[i], test_y[i])
        showX(test_X[i])