12 - Introduction to Deep Learning

by Alejandro Correa Bahnsen

version 0.1, May 2016

Part of the class Machine Learning for Security Informatics

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License

Based on the slides and presentation by Alec Radford github

For this class you must install theno

pip instal theano

Motivation

How do we program a computer to recognize a picture of a handwritten digit as a 0-9?

What if we have 60,000 of these images and their label?



In [1]:

    
import numpy as np



In [2]:

    
from load import mnist
X_train, X_test, y_train2, y_test2 = mnist(onehot=True)



In [3]:

    
y_train = np.argmax(y_train2, axis=1)
y_test = np.argmax(y_test2, axis=1)



In [4]:

    
X_train[1].reshape((28, 28)).round(2)[:, 4:9].tolist()









    Out[4]:





[[0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.15],
 [0.0, 0.0, 0.0, 0.03, 0.7],
 [0.0, 0.0, 0.0, 0.22, 0.99],
 [0.0, 0.0, 0.0, 0.78, 0.99],
 [0.0, 0.0, 0.3, 0.96, 0.99],
 [0.0, 0.0, 0.33, 0.99, 0.9],
 [0.0, 0.0, 0.33, 0.99, 0.87],
 [0.0, 0.0, 0.33, 0.99, 0.57],
 [0.0, 0.0, 0.34, 0.99, 0.88],
 [0.0, 0.0, 0.33, 0.99, 0.98],
 [0.0, 0.0, 0.33, 0.99, 0.99],
 [0.0, 0.0, 0.11, 0.78, 0.99],
 [0.0, 0.0, 0.0, 0.1, 0.5],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.0, 0.0]]



In [7]:

    
from pylab import imshow, show, cm
import matplotlib.pylab as plt
%matplotlib inline

def view_image(image, label="", predicted='', size=4):
    """View a single image."""
    plt.figure(figsize = (size, size))
    plt.imshow(image.reshape((28, 28)), cmap=cm.gray, )
    plt.tick_params(axis='x',which='both',  bottom='off',top='off', labelbottom='off')
    plt.tick_params(axis='y',which='both',  left='off',top='off', labelleft='off')
    show()
    if predicted == '':
        print("Label: %s" % label)
    else:
        print('Label: ', str(label), 'Predicted: ', str(predicted))



In [8]:

    
view_image(X_train[1], y_train[1])



In [9]:

    
view_image(X_train[40000], y_train[40000])

Naive model

For each image, find the “most similar” image and guess that as the label



In [11]:

    
def similarity(image, images):
    similarities = []
    image = image.reshape((28, 28))
    images = images.reshape((-1, 28, 28))
    for i in range(images.shape[0]):
        distance = np.sqrt(np.sum(image - images[i]) ** 2)
        sim = 1 / distance
        similarities.append(sim)
    return similarities



In [12]:

    
np.random.seed(52)
small_train = np.random.choice(X_train.shape[0], 100)



In [13]:

    
view_image(X_test[0])



In [14]:

    
similarities = similarity(X_test[0], X_train[small_train])



In [15]:

    
view_image(X_train[small_train[np.argmax(similarities)]])

Lets try an other example



In [16]:

    
view_image(X_test[200])



In [17]:

    
similarities = similarity(X_test[200], X_train[small_train])
view_image(X_train[small_train[np.argmax(similarities)]])

Logistic Regression

Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix $W$ and a bias vector $b$ Classification is done by projecting data points onto a set of hyperplanes, the distance to which is used to determine a class membership probability.

Mathematically, this can be written as:

$$ P(Y=i\vert x, W,b) = softmax_i(W x + b) $$$$ P(Y=i|x, W,b) = \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}} $$

The output of the model or prediction is then done by taking the argmax of the vector whose i'th element is $P(Y=i|x)$.

$$ y_{pred} = argmax_i P(Y=i|x,W,b) $$



In [18]:

    
import theano
from theano import tensor as T
import numpy as np
import datetime as dt



In [19]:

    
theano.config.floatX = 'float32'

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.

Theano combines aspects of a computer algebra system (CAS) with aspects of an optimizing compiler. It can also generate customized C code for many mathematical operations. This combination of CAS with optimizing compilation is particularly useful for tasks in which complicated mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations where many different expressions are each evaluated once Theano can minimize the amount of compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.



In [22]:

    
def floatX(X):
#     return np.asarray(X, dtype='float32')
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    return theano.shared(floatX(np.random.randn(*shape) * 0.01))

def model(X, w):
    return T.nnet.softmax(T.dot(X, w))



In [23]:

    
X = T.fmatrix()
Y = T.fmatrix()

w = init_weights((784, 10))



In [24]:

    
w.get_value()









    Out[24]:





array([[ -5.39610814e-03,  -1.43594560e-04,   1.69246327e-02, ...,
         -2.78904056e-03,  -1.08682849e-02,  -1.10536127e-03],
       [  1.34781636e-02,  -1.78177636e-02,  -1.24749457e-02, ...,
         -9.31460038e-03,  -5.03959786e-03,   4.43599274e-05],
       [  1.08171478e-02,  -7.10076233e-03,  -5.88069623e-03, ...,
         -2.36368985e-04,  -1.61193740e-02,  -1.11860246e-03],
       ..., 
       [  9.94920544e-03,  -1.58130825e-02,   1.65188278e-03, ...,
         -4.96242149e-03,  -8.17313045e-03,  -8.21825489e-03],
       [ -1.15539385e-02,   2.28418992e-03,  -2.11755149e-02, ...,
          3.36804474e-03,  -1.58600565e-02,   8.99706781e-03],
       [  6.98694680e-03,  -1.28191579e-02,   9.20862891e-03, ...,
         -2.52995640e-02,   8.57820641e-03,   1.18237082e-03]], dtype=float32)

initialize model



In [25]:

    
py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)



In [26]:

    
cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * 0.05]]



In [27]:

    
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)

One iteration



In [28]:

    
for start, end in zip(range(0, X_train.shape[0], 128), range(128, X_train.shape[0], 128)):
    cost = train(X_train[start:end], y_train2[start:end])



In [29]:

    
errors = [(np.mean(y_train != predict(X_train)), 
           np.mean(y_test != predict(X_test)))]
errors









    Out[29]:





[(0.122, 0.1145)]

Now for 100 epochs



In [30]:

    
t0 = dt.datetime.now()

for i in range(100):
    
    for start, end in zip(range(0, X_train.shape[0], 128), 
                          range(128, X_train.shape[0], 128)):
        cost = train(X_train[start:end], y_train2[start:end])
        
    errors.append((np.mean(y_train != predict(X_train)), 
                   np.mean(y_test != predict(X_test))))
    print(i, errors[-1])

print('Total time: ', (dt.datetime.now()-t0).seconds / 60.)









    



0 (0.10898333333333333, 0.1018)
1 (0.10215, 0.094700000000000006)
2 (0.098400000000000001, 0.092600000000000002)
3 (0.095466666666666672, 0.090399999999999994)
4 (0.093316666666666673, 0.088700000000000001)
5 (0.091283333333333327, 0.087300000000000003)
6 (0.089716666666666667, 0.086300000000000002)
7 (0.088650000000000007, 0.0848)
8 (0.08748333333333333, 0.084199999999999997)
9 (0.08663333333333334, 0.084400000000000003)
10 (0.085599999999999996, 0.083500000000000005)
11 (0.08483333333333333, 0.083199999999999996)
12 (0.0843, 0.083000000000000004)
13 (0.08376666666666667, 0.082199999999999995)
14 (0.083416666666666667, 0.081900000000000001)
15 (0.083116666666666672, 0.0814)
16 (0.082666666666666666, 0.081199999999999994)
17 (0.082150000000000001, 0.080699999999999994)
18 (0.081816666666666663, 0.080399999999999999)
19 (0.08165, 0.080299999999999996)
20 (0.081316666666666662, 0.080000000000000002)
21 (0.0809, 0.079600000000000004)
22 (0.080699999999999994, 0.079399999999999998)
23 (0.080266666666666667, 0.079500000000000001)
24 (0.080016666666666666, 0.079200000000000007)
25 (0.079716666666666672, 0.079000000000000001)
26 (0.07931666666666666, 0.079000000000000001)
27 (0.079133333333333333, 0.078600000000000003)
28 (0.078850000000000003, 0.078700000000000006)
29 (0.078566666666666674, 0.0785)
30 (0.078316666666666673, 0.078299999999999995)
31 (0.077933333333333327, 0.078100000000000003)
32 (0.077799999999999994, 0.078)
33 (0.077616666666666667, 0.078)
34 (0.077566666666666673, 0.078200000000000006)
35 (0.077366666666666667, 0.077899999999999997)
36 (0.077149999999999996, 0.078)
37 (0.076966666666666669, 0.078200000000000006)
38 (0.076749999999999999, 0.078200000000000006)
39 (0.076633333333333331, 0.077600000000000002)
40 (0.076450000000000004, 0.077700000000000005)
41 (0.076333333333333336, 0.077799999999999994)
42 (0.076149999999999995, 0.077700000000000005)
43 (0.076016666666666663, 0.077799999999999994)
44 (0.075850000000000001, 0.077700000000000005)
45 (0.075600000000000001, 0.077700000000000005)
46 (0.075516666666666662, 0.077600000000000002)
47 (0.075300000000000006, 0.077600000000000002)
48 (0.075249999999999997, 0.077200000000000005)
49 (0.075116666666666665, 0.077100000000000002)
50 (0.075033333333333327, 0.077200000000000005)
51 (0.074950000000000003, 0.077299999999999994)
52 (0.074866666666666665, 0.077299999999999994)
53 (0.074716666666666667, 0.077200000000000005)
54 (0.07456666666666667, 0.076799999999999993)
55 (0.074366666666666664, 0.076700000000000004)
56 (0.074216666666666667, 0.076700000000000004)
57 (0.07403333333333334, 0.076600000000000001)
58 (0.073849999999999999, 0.076399999999999996)
59 (0.073766666666666661, 0.076300000000000007)
60 (0.073700000000000002, 0.076200000000000004)
61 (0.073533333333333339, 0.075899999999999995)
62 (0.073499999999999996, 0.075999999999999998)
63 (0.073333333333333334, 0.075899999999999995)
64 (0.073216666666666666, 0.075999999999999998)
65 (0.073099999999999998, 0.075899999999999995)
66 (0.073099999999999998, 0.075899999999999995)
67 (0.072933333333333336, 0.075700000000000003)
68 (0.072800000000000004, 0.075600000000000001)
69 (0.07273333333333333, 0.075700000000000003)
70 (0.072700000000000001, 0.075700000000000003)
71 (0.072633333333333328, 0.075700000000000003)
72 (0.072550000000000003, 0.075899999999999995)
73 (0.072450000000000001, 0.075999999999999998)
74 (0.072416666666666671, 0.076100000000000001)
75 (0.072400000000000006, 0.075999999999999998)
76 (0.072333333333333333, 0.075899999999999995)
77 (0.07223333333333333, 0.075899999999999995)
78 (0.072183333333333335, 0.075800000000000006)
79 (0.072166666666666671, 0.075800000000000006)
80 (0.072150000000000006, 0.075899999999999995)
81 (0.072050000000000003, 0.075800000000000006)
82 (0.071999999999999995, 0.075800000000000006)
83 (0.07195, 0.075700000000000003)
84 (0.07195, 0.075600000000000001)
85 (0.071900000000000006, 0.075600000000000001)
86 (0.071866666666666662, 0.075600000000000001)
87 (0.071866666666666662, 0.075499999999999998)
88 (0.071800000000000003, 0.075499999999999998)
89 (0.071716666666666665, 0.075499999999999998)
90 (0.07173333333333333, 0.075300000000000006)
91 (0.071633333333333327, 0.075300000000000006)
92 (0.071550000000000002, 0.075200000000000003)
93 (0.071483333333333329, 0.075200000000000003)
94 (0.07141666666666667, 0.075200000000000003)
95 (0.071400000000000005, 0.075200000000000003)
96 (0.071283333333333337, 0.075200000000000003)
97 (0.071249999999999994, 0.075200000000000003)
98 (0.07116666666666667, 0.075200000000000003)
99 (0.07113333333333334, 0.075300000000000006)
Total time:  0.38333333333333336



In [31]:

    
res = np.array(errors)
plt.plot(np.arange(res.shape[0]), res[:, 0], label='train error')
plt.plot(np.arange(res.shape[0]), res[:, 1], label='test  error')
plt.legend()









    Out[31]:





<matplotlib.legend.Legend at 0x7fb475802828>

Checking the results



In [32]:

    
y_pred = predict(X_test)



In [33]:

    
np.random.seed(2)
small_test = np.random.choice(X_test.shape[0], 10)

for i in small_test:
    view_image(X_test[i], label=y_test[i], predicted=y_pred[i], size=1)









    












    



Label:  9 Predicted:  9






    












    



Label:  3 Predicted:  3






    












    



Label:  4 Predicted:  4






    












    



Label:  4 Predicted:  9






    












    



Label:  6 Predicted:  6






    












    



Label:  3 Predicted:  3






    












    



Label:  5 Predicted:  5






    












    



Label:  1 Predicted:  1






    












    



Label:  5 Predicted:  5






    












    



Label:  2 Predicted:  2

Simple Neural Net

Add a hidden layer with a sigmoid activation function



In [34]:

    
def sgd(cost, params, lr=0.05):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        updates.append([p, p - g * lr])
    return updates

def model(X, w_h, w_o):
    h = T.nnet.sigmoid(T.dot(X, w_h))
    pyx = T.nnet.softmax(T.dot(h, w_o))
    return pyx

w_h = init_weights((784, 625))
w_o = init_weights((625, 10))

py_x = model(X, w_h, w_o)
y_x = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
params = [w_h, w_o]
updates = sgd(cost, params)

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_x, allow_input_downcast=True)



In [30]:

    
t0 = dt.datetime.now()

errors = []
for i in range(100):
    
    for start, end in zip(range(0, X_train.shape[0], 128), 
                          range(128, X_train.shape[0], 128)):
        cost = train(X_train[start:end], y_train2[start:end])
        
    errors.append((np.mean(y_train != predict(X_train)), 
                   np.mean(y_test != predict(X_test))))
    print(i, errors[-1])

print('Total time: ', (dt.datetime.now()-t0).seconds / 60.)









    



0 (0.30408333333333332, 0.29780000000000001)
1 (0.17413333333333333, 0.1714)
2 (0.13850000000000001, 0.1328)
3 (0.12273333333333333, 0.1168)
4 (0.11371666666666666, 0.1101)
5 (0.1079, 0.1051)
6 (0.10331666666666667, 0.10150000000000001)
7 (0.10036666666666667, 0.098000000000000004)
8 (0.097699999999999995, 0.095100000000000004)
9 (0.095750000000000002, 0.092799999999999994)
10 (0.093700000000000006, 0.0906)
11 (0.092133333333333331, 0.088300000000000003)
12 (0.090683333333333338, 0.0872)
13 (0.089316666666666669, 0.085699999999999998)
14 (0.088016666666666674, 0.085000000000000006)
15 (0.086966666666666664, 0.084099999999999994)
16 (0.086099999999999996, 0.083500000000000005)
17 (0.085316666666666666, 0.082900000000000001)
18 (0.084400000000000003, 0.082000000000000003)
19 (0.083783333333333335, 0.081699999999999995)
20 (0.082799999999999999, 0.081199999999999994)
21 (0.082166666666666666, 0.080600000000000005)
22 (0.081416666666666665, 0.080000000000000002)
23 (0.080766666666666667, 0.079200000000000007)
24 (0.079916666666666664, 0.078700000000000006)
25 (0.079200000000000007, 0.078600000000000003)
26 (0.078750000000000001, 0.078200000000000006)
27 (0.07825, 0.077899999999999997)
28 (0.077799999999999994, 0.077299999999999994)
29 (0.07721666666666667, 0.076499999999999999)
30 (0.076666666666666661, 0.075899999999999995)
31 (0.076050000000000006, 0.075999999999999998)
32 (0.075266666666666662, 0.0751)
33 (0.07456666666666667, 0.074399999999999994)
34 (0.073666666666666672, 0.073899999999999993)
35 (0.072999999999999995, 0.073300000000000004)
36 (0.072383333333333327, 0.073200000000000001)
37 (0.071633333333333327, 0.072700000000000001)
38 (0.070866666666666661, 0.072599999999999998)
39 (0.070316666666666666, 0.072099999999999997)
40 (0.069449999999999998, 0.071099999999999997)
41 (0.06876666666666667, 0.070499999999999993)
42 (0.068166666666666667, 0.069699999999999998)
43 (0.067549999999999999, 0.069199999999999998)
44 (0.066733333333333339, 0.068900000000000003)
45 (0.066133333333333336, 0.0688)
46 (0.065299999999999997, 0.068199999999999997)
47 (0.064466666666666672, 0.066799999999999998)
48 (0.063783333333333331, 0.066199999999999995)
49 (0.063233333333333336, 0.066100000000000006)
50 (0.062449999999999999, 0.0654)
51 (0.061866666666666667, 0.064500000000000002)
52 (0.061366666666666667, 0.063799999999999996)
53 (0.060633333333333331, 0.062700000000000006)
54 (0.06001666666666667, 0.061899999999999997)
55 (0.059366666666666665, 0.061100000000000002)
56 (0.058716666666666667, 0.0608)
57 (0.058083333333333334, 0.060299999999999999)
58 (0.057283333333333332, 0.059900000000000002)
59 (0.056583333333333333, 0.059299999999999999)
60 (0.055933333333333335, 0.058799999999999998)
61 (0.055316666666666667, 0.058099999999999999)
62 (0.054649999999999997, 0.057700000000000001)
63 (0.054100000000000002, 0.056899999999999999)
64 (0.05358333333333333, 0.056300000000000003)
65 (0.053066666666666665, 0.055800000000000002)
66 (0.052650000000000002, 0.055199999999999999)
67 (0.052033333333333334, 0.054899999999999997)
68 (0.051650000000000001, 0.054600000000000003)
69 (0.051283333333333334, 0.054100000000000002)
70 (0.050766666666666668, 0.053699999999999998)
71 (0.050316666666666669, 0.053100000000000001)
72 (0.049799999999999997, 0.052699999999999997)
73 (0.049299999999999997, 0.052600000000000001)
74 (0.048916666666666664, 0.051499999999999997)
75 (0.048550000000000003, 0.051499999999999997)
76 (0.047966666666666664, 0.051299999999999998)
77 (0.047600000000000003, 0.050700000000000002)
78 (0.047133333333333333, 0.0504)
79 (0.046716666666666663, 0.050200000000000002)
80 (0.046183333333333333, 0.049399999999999999)
81 (0.0458, 0.049200000000000001)
82 (0.045350000000000001, 0.048899999999999999)
83 (0.044883333333333331, 0.048599999999999997)
84 (0.044416666666666667, 0.048000000000000001)
85 (0.043900000000000002, 0.046800000000000001)
86 (0.043483333333333332, 0.046800000000000001)
87 (0.042900000000000001, 0.046600000000000003)
88 (0.042516666666666668, 0.046199999999999998)
89 (0.042133333333333335, 0.0458)
90 (0.041633333333333335, 0.045499999999999999)
91 (0.04123333333333333, 0.0453)
92 (0.040733333333333337, 0.044999999999999998)
93 (0.040349999999999997, 0.044400000000000002)
94 (0.040083333333333332, 0.044200000000000003)
95 (0.039866666666666668, 0.0441)
96 (0.039449999999999999, 0.043799999999999999)
97 (0.039166666666666669, 0.043499999999999997)
98 (0.038883333333333332, 0.043299999999999998)
99 (0.038516666666666664, 0.042799999999999998)
Total time:  7.066666666666666



In [31]:

    
res = np.array(errors)
plt.plot(np.arange(res.shape[0]), res[:, 0], label='train error')
plt.plot(np.arange(res.shape[0]), res[:, 1], label='test  error')
plt.legend()









    Out[31]:





<matplotlib.legend.Legend at 0x7f546160d208>

Complex Neural Net

Two hidden layers with dropout



In [35]:

    
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams

srng = RandomStreams()

def rectify(X):
    return T.maximum(X, 0.)

Understanding rectifier units



In [36]:

    
def RMSprop(cost, params, lr=0.001, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        acc = theano.shared(p.get_value() * 0.)
        acc_new = rho * acc + (1 - rho) * g ** 2
        gradient_scaling = T.sqrt(acc_new + epsilon)
        g = g / gradient_scaling
        updates.append((acc, acc_new))
        updates.append((p, p - lr * g))
    return updates

RMSprop

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class

RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above:

$$ E[g^2]_t = 0.9 E[g^2]_{t-1} + 0.1 g^2_t. $$$$\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t}.$$

RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests $\gamma$ to be set to 0.9, while a good default value for the learning rate $\eta$ is 0.001.



In [37]:

    
def dropout(X, p=0.):
    if p > 0:
        retain_prob = 1 - p
        X *= srng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
        X /= retain_prob
    return X

def model(X, w_h, w_h2, w_o, p_drop_input, p_drop_hidden):
    X = dropout(X, p_drop_input)
    h = rectify(T.dot(X, w_h))

    h = dropout(h, p_drop_hidden)
    h2 = rectify(T.dot(h, w_h2))

    h2 = dropout(h2, p_drop_hidden)
    py_x = softmax(T.dot(h2, w_o))
    return h, h2, py_x

def softmax(X):
    e_x = T.exp(X - X.max(axis=1).dimshuffle(0, 'x'))
    return e_x / e_x.sum(axis=1).dimshuffle(0, 'x')



In [35]:

    
w_h = init_weights((784, 625))
w_h2 = init_weights((625, 625))
w_o = init_weights((625, 10))

noise_h, noise_h2, noise_py_x = model(X, w_h, w_h2, w_o, 0.2, 0.5)
h, h2, py_x = model(X, w_h, w_h2, w_o, 0., 0.)
y_x = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(noise_py_x, Y))
params = [w_h, w_h2, w_o]
updates = RMSprop(cost, params, lr=0.001)

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_x, allow_input_downcast=True)



In [36]:

    
t0 = dt.datetime.now()

errors = []
for i in range(100):
    
    for start, end in zip(range(0, X_train.shape[0], 128), 
                          range(128, X_train.shape[0], 128)):
        cost = train(X_train[start:end], y_train2[start:end])
        
    errors.append((np.mean(y_train != predict(X_train)), 
                   np.mean(y_test != predict(X_test))))
    print(i, errors[-1])

print('Total time: ', (dt.datetime.now()-t0).seconds / 60.)









    



0 (0.063916666666666663, 0.062399999999999997)
1 (0.034116666666666663, 0.035400000000000001)
2 (0.025133333333333334, 0.027799999999999998)
3 (0.020866666666666665, 0.025600000000000001)
4 (0.017466666666666665, 0.023900000000000001)
5 (0.015233333333333333, 0.0206)
6 (0.015366666666666667, 0.021700000000000001)
7 (0.012366666666666666, 0.020199999999999999)
8 (0.011950000000000001, 0.019800000000000002)
9 (0.011983333333333334, 0.020199999999999999)
10 (0.0091000000000000004, 0.0178)
11 (0.0084666666666666675, 0.017000000000000001)
12 (0.0086999999999999994, 0.017500000000000002)
13 (0.0074333333333333335, 0.016500000000000001)
14 (0.0067999999999999996, 0.017100000000000001)
15 (0.0066166666666666665, 0.016299999999999999)
16 (0.005783333333333333, 0.015699999999999999)
17 (0.0061833333333333332, 0.0161)
18 (0.0058666666666666667, 0.015100000000000001)
19 (0.0061500000000000001, 0.017100000000000001)
20 (0.0050499999999999998, 0.016199999999999999)
21 (0.004816666666666667, 0.016500000000000001)
22 (0.0045833333333333334, 0.0161)
23 (0.0041000000000000003, 0.015699999999999999)
24 (0.0040666666666666663, 0.016)
25 (0.0035666666666666668, 0.0149)
26 (0.0043333333333333331, 0.0161)
27 (0.0032666666666666669, 0.013899999999999999)
28 (0.0033333333333333335, 0.0143)
29 (0.0029166666666666668, 0.013100000000000001)
30 (0.0032166666666666667, 0.014500000000000001)
31 (0.0033333333333333335, 0.014999999999999999)
32 (0.0028333333333333335, 0.015100000000000001)
33 (0.0029666666666666665, 0.0146)
34 (0.0030500000000000002, 0.0137)
35 (0.0029166666666666668, 0.0147)
36 (0.0023666666666666667, 0.014200000000000001)
37 (0.0030000000000000001, 0.0149)
38 (0.0028333333333333335, 0.0137)
39 (0.0020666666666666667, 0.012699999999999999)
40 (0.0021666666666666666, 0.0134)
41 (0.0023666666666666667, 0.0129)
42 (0.0023500000000000001, 0.0141)
43 (0.0021666666666666666, 0.013899999999999999)
44 (0.0021833333333333331, 0.0154)
45 (0.0020500000000000002, 0.0146)
46 (0.0021833333333333331, 0.014800000000000001)
47 (0.0016166666666666666, 0.014)
48 (0.00175, 0.012800000000000001)
49 (0.0017333333333333333, 0.012999999999999999)
50 (0.0016833333333333333, 0.012999999999999999)
51 (0.0013833333333333334, 0.0118)
52 (0.0019333333333333333, 0.014200000000000001)
53 (0.0013166666666666667, 0.012500000000000001)
54 (0.0015, 0.013299999999999999)
55 (0.0012166666666666667, 0.0132)
56 (0.0012999999999999999, 0.012)
57 (0.00115, 0.011900000000000001)
58 (0.0014333333333333333, 0.012999999999999999)
59 (0.0014666666666666667, 0.012500000000000001)
60 (0.0010833333333333333, 0.0124)
61 (0.00125, 0.0126)
62 (0.0013333333333333333, 0.013299999999999999)
63 (0.0012166666666666667, 0.0126)
64 (0.0011000000000000001, 0.0129)
65 (0.0010833333333333333, 0.012800000000000001)
66 (0.0010166666666666666, 0.012999999999999999)
67 (0.0010666666666666667, 0.013299999999999999)
68 (0.001, 0.013100000000000001)
69 (0.0010499999999999999, 0.013100000000000001)
70 (0.00089999999999999998, 0.0121)
71 (0.00075000000000000002, 0.012699999999999999)
72 (0.00080000000000000004, 0.012999999999999999)
73 (0.00084999999999999995, 0.0129)
74 (0.00083333333333333339, 0.012200000000000001)
75 (0.00083333333333333339, 0.012200000000000001)
76 (0.00081666666666666671, 0.0121)
77 (0.00075000000000000002, 0.013299999999999999)
78 (0.00061666666666666662, 0.0124)
79 (0.00066666666666666664, 0.0134)
80 (0.0006333333333333333, 0.012800000000000001)
81 (0.00061666666666666662, 0.013100000000000001)
82 (0.0006333333333333333, 0.0134)
83 (0.00080000000000000004, 0.012200000000000001)
84 (0.00061666666666666662, 0.0124)
85 (0.00069999999999999999, 0.0129)
86 (0.00075000000000000002, 0.0126)
87 (0.00046666666666666666, 0.011900000000000001)
88 (0.00056666666666666671, 0.0118)
89 (0.00051666666666666668, 0.0121)
90 (0.00078333333333333336, 0.012)
91 (0.00051666666666666668, 0.012)
92 (0.00050000000000000001, 0.0118)
93 (0.00056666666666666671, 0.012200000000000001)
94 (0.00055000000000000003, 0.0124)
95 (0.00069999999999999999, 0.012800000000000001)
96 (0.00058333333333333338, 0.0135)
97 (0.00055000000000000003, 0.012699999999999999)
98 (0.00050000000000000001, 0.012999999999999999)
99 (0.00048333333333333334, 0.0117)
Total time:  17.85



In [37]:

    
res = np.array(errors)
plt.plot(np.arange(res.shape[0]), res[:, 0], label='train error')
plt.plot(np.arange(res.shape[0]), res[:, 1], label='test  error')
plt.legend()









    Out[37]:





<matplotlib.legend.Legend at 0x7f545e83d4a8>

Convolutional Neural Network

In machine learning, a convolutional neural network (CNN, or ConvNet) is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field. Convolutional networks were inspired by biological processes and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing. (Wikipedia)

Motivation

Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs. From Hubel and Wiesel's early work on the cat's visual cortex, we know the visual cortex contains a complex arrangement of cells. These cells are sensitive to small sub-regions of the visual field, called a receptive field. The sub-regions are tiled to cover the entire visual field. These cells act as local filters over the input space and are well-suited to exploit the strong spatially local correlation present in natural images.

Additionally, two basic cell types have been identified: Simple cells respond maximally to specific edge-like patterns within their receptive field. Complex cells have larger receptive fields and are locally invariant to the exact position of the pattern.

The animal visual cortex being the most powerful visual processing system in existence, it seems natural to emulate its behavior. Hence, many neurally-inspired models can be found in the literature.

Sparse Connectivity

CNNs exploit spatially-local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. In other words, the inputs of hidden units in layer m are from a subset of units in layer m-1, units that have spatially contiguous receptive fields. We can illustrate this graphically as follows:

Imagine that layer m-1 is the input retina. In the above figure, units in layer m have receptive fields of width 3 in the input retina and are thus only connected to 3 adjacent neurons in the retina layer. Units in layer m+1 have a similar connectivity with the layer below. We say that their receptive field with respect to the layer below is also 3, but their receptive field with respect to the input is larger (5). Each unit is unresponsive to variations outside of its receptive field with respect to the retina. The architecture thus ensures that the learnt "filters" produce the strongest response to a spatially local input pattern.

However, as shown above, stacking many such layers leads to (non-linear) "filters" that become increasingly "global" (i.e. responsive to a larger region of pixel space). For example, the unit in hidden layer m+1 can encode a non-linear feature of width 5 (in terms of pixel space).

Shared Weights

In addition, in CNNs, each filter $h_i$ is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map.

In the above figure, we show 3 hidden units belonging to the same feature map. Weights of the same color are shared---constrained to be identical. Gradient descent can still be used to learn such shared parameters, with only a small change to the original algorithm. The gradient of a shared weight is simply the sum of the gradients of the parameters being shared.

Replicating units in this way allows for features to be detected regardless of their position in the visual field. Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters being learnt. The constraints on the model enable CNNs to achieve better generalization on vision problems.

Details and Notation

A feature map is obtained by repeated application of a function across sub-regions of the entire image, in other words, by convolution of the input image with a linear filter, adding a bias term and then applying a non-linear function. If we denote the k-th feature map at a given layer as $h^k$, whose filters are determined by the weights $W^k$ and bias $b_k$, then the feature map $h^k$ is obtained as follows (for $tanh$ non-linearities):

$$ h^k_{ij} = \tanh ( (W^k * x)_{ij} + b_k ). $$

Note

Recall the following definition of convolution for a 1D signal. $$ o[n] = f[n]*g[n] = \sum_{u=-\infty}^{\infty} f[u] g[n-u] = \sum_{u=-\infty}^{\infty} f[n-u] g[u]`. $$
This can be extended to 2D as follows:

$$o[m,n] = f[m,n]*g[m,n] = \sum_{u=-\infty}^{\infty} \sum_{v=-\infty}^{\infty} f[u,v] g[m-u,n-v]`. $$

To form a richer representation of the data, each hidden layer is composed of multiple feature maps, $\{h^{(k)}, k=0..K\}$. The weights $W$ of a hidden layer can be represented in a 4D tensor containing elements for every combination of destination feature map, source feature map, source vertical position, and source horizontal position. The biases $b$ can be represented as a vector containing one element for every destination feature map. We illustrate this graphically as follows:

Figure 1: example of a convolutional layer

The figure shows two layers of a CNN. Layer m-1 contains four feature maps. Hidden layer m contains two feature maps ($h^0$ and $h^1$). Pixels (neuron outputs) in $h^0$ and $h^1$ (outlined as blue and red squares) are computed from pixels of layer (m-1) which fall within their 2x2 receptive field in the layer below (shown as colored rectangles). Notice how the receptive field spans all four input feature maps. The weights $W^0$ and $W^1$ of $h^0$ and $h^1$ are thus 3D weight tensors. The leading dimension indexes the input feature maps, while the other two refer to the pixel coordinates.

Putting it all together, $W^{kl}_{ij}$ denotes the weight connecting each pixel of the k-th feature map at layer m, with the pixel at coordinates (i,j) of the l-th feature map of layer (m-1).

The Convolution Operator

ConvOp is the main workhorse for implementing a convolutional layer in Theano. ConvOp is used by theano.tensor.signal.conv2d, which takes two symbolic inputs:

a 4D tensor corresponding to a mini-batch of input images. The shape of the tensor is as follows: [mini-batch size, number of input feature maps, image height, image width].
a 4D tensor corresponding to the weight matrix $W$. The shape of the tensor is: [number of feature maps at layer m, number of feature maps at layer m-1, filter height, filter width]

MaxPooling

Another important concept of CNNs is max-pooling, which is a form of non-linear down-sampling. Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.

Max-pooling is useful in vision for two reasons:

By eliminating non-maximal values, it reduces computation for upper layers.
It provides a form of translation invariance. Imagine cascading a max-pooling layer with a convolutional layer. There are 8 directions in which one can translate the input image by a single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.

Since it provides additional robustness to position, max-pooling is a "smart" way of reducing the dimensionality of intermediate representations.

Max-pooling is done in Theano by way of theano.tensor.signal.downsample.max_pool_2d. This function takes as input an N dimensional tensor (where N >= 2) and a downscaling factor and performs max-pooling over the 2 trailing dimensions of the tensor.

The Full Model: CovNet

Sparse, convolutional layers and max-pooling are at the heart of the LeNet family of models. While the exact details of the model will vary greatly, the figure below shows a graphical depiction of a LeNet model.

The lower-layers are composed to alternating convolution and max-pooling layers. The upper-layers however are fully-connected and correspond to a traditional MLP (hidden layer + logistic regression). The input to the first fully-connected layer is the set of all features maps at the layer below.

From an implementation point of view, this means lower-layers operate on 4D tensors. These are then flattened to a 2D matrix of rasterized feature maps, to be compatible with our previous MLP implementation.



In [38]:

    
# from theano.tensor.nnet.conv import conv2d
from theano.tensor.nnet import conv2d
from theano.tensor.signal.downsample import max_pool_2d









    



/home/al/anaconda3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")

Modify dropout function



In [39]:

    
def model(X, w, w2, w3, w4, w_o, p_drop_conv, p_drop_hidden):
    l1a = rectify(conv2d(X, w, border_mode='full'))
    l1 = max_pool_2d(l1a, (2, 2))
    l1 = dropout(l1, p_drop_conv)

    l2a = rectify(conv2d(l1, w2))
    l2 = max_pool_2d(l2a, (2, 2))
    l2 = dropout(l2, p_drop_conv)

    l3a = rectify(conv2d(l2, w3))
    l3b = max_pool_2d(l3a, (2, 2))
    # convert from 4tensor to normal matrix
    l3 = T.flatten(l3b, outdim=2)
    l3 = dropout(l3, p_drop_conv)

    l4 = rectify(T.dot(l3, w4))
    l4 = dropout(l4, p_drop_hidden)

    pyx = softmax(T.dot(l4, w_o))
    return l1, l2, l3, l4, pyx

reshape into conv 4tensor (b, c, 0, 1) format



In [40]:

    
X_train2 = X_train.reshape(-1, 1, 28, 28)
X_test2 = X_test.reshape(-1, 1, 28, 28)



In [41]:

    
# now 4tensor for conv instead of matrix
X = T.ftensor4()
Y = T.fmatrix()



In [42]:

    
w = init_weights((32, 1, 3, 3))
w2 = init_weights((64, 32, 3, 3))
w3 = init_weights((128, 64, 3, 3))
w4 = init_weights((128 * 3 * 3, 625))
w_o = init_weights((625, 10))



In [43]:

    
noise_l1, noise_l2, noise_l3, noise_l4, noise_py_x = model(X, w, w2, w3, w4, w_o, 0.2, 0.5)
l1, l2, l3, l4, py_x = model(X, w, w2, w3, w4, w_o, 0., 0.)
y_x = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(noise_py_x, Y))
params = [w, w2, w3, w4, w_o]
updates = RMSprop(cost, params, lr=0.001)

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_x, allow_input_downcast=True)









    



/home/al/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:3: UserWarning: pool_2d() will have the parameter ignore_border default value changed to True (currently False). To have consistent behavior with all Theano version, explicitly add the parameter ignore_border=True. On the GPU, using ignore_border=True is needed to use cuDNN. When using ignore_border=False and not using cuDNN, the only GPU combination supported is when `ds == st and padding == (0, 0) and mode == 'max'`. Otherwise, the convolution will be executed on CPU.
  app.launch_new_instance()
/home/al/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:7: UserWarning: pool_2d() will have the parameter ignore_border default value changed to True (currently False). To have consistent behavior with all Theano version, explicitly add the parameter ignore_border=True. On the GPU, using ignore_border=True is needed to use cuDNN. When using ignore_border=False and not using cuDNN, the only GPU combination supported is when `ds == st and padding == (0, 0) and mode == 'max'`. Otherwise, the convolution will be executed on CPU.
/home/al/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:11: UserWarning: pool_2d() will have the parameter ignore_border default value changed to True (currently False). To have consistent behavior with all Theano version, explicitly add the parameter ignore_border=True. On the GPU, using ignore_border=True is needed to use cuDNN. When using ignore_border=False and not using cuDNN, the only GPU combination supported is when `ds == st and padding == (0, 0) and mode == 'max'`. Otherwise, the convolution will be executed on CPU.



In [45]:

    
t0 = dt.datetime.now()

errors = []
for i in range(100):
    t1 = dt.datetime.now()
    
    for start, end in zip(range(0, X_train.shape[0], 128), 
                          range(128, X_train.shape[0], 128)):
        cost = train(X_train2[start:end], y_train2[start:end])
        
    errors.append((np.mean(y_train != predict(X_train2)), 
                   np.mean(y_test != predict(X_test2))))
    print(i, errors[-1])
    print('Current iter time: ', (dt.datetime.now()-t1).seconds / 60.)

print('Total time: ', (dt.datetime.now()-t0).seconds / 60.)









    



0 (0.066083333333333327, 0.058799999999999998)
Current iter time:  3.966666666666667
1 (0.029049999999999999, 0.025100000000000001)
Current iter time:  2.3666666666666667
2 (0.021183333333333332, 0.019400000000000001)
Current iter time:  2.466666666666667
3 (0.012483333333333334, 0.0132)
Current iter time:  2.466666666666667
4 (0.010183333333333334, 0.0117)
Current iter time:  2.45
5 (0.0080333333333333333, 0.010500000000000001)
Current iter time:  2.4833333333333334
6 (0.0078499999999999993, 0.010500000000000001)
Current iter time:  2.3666666666666667
7 (0.0064999999999999997, 0.0085000000000000006)
Current iter time:  2.4833333333333334
8 (0.0053166666666666666, 0.0088000000000000005)
Current iter time:  2.466666666666667
9 (0.0067166666666666668, 0.0092999999999999992)
Current iter time:  2.5166666666666666
10 (0.003966666666666667, 0.0080999999999999996)
Current iter time:  2.5166666666666666
11 (0.0043666666666666663, 0.0077999999999999996)
Current iter time:  2.55
12 (0.0029499999999999999, 0.0080000000000000002)
Current iter time:  2.5833333333333335
13 (0.0028166666666666665, 0.0077000000000000002)
Current iter time:  2.45
14 (0.0028500000000000001, 0.0071999999999999998)
Current iter time:  2.466666666666667
15 (0.0021666666666666666, 0.0067000000000000002)
Current iter time:  2.4833333333333334
16 (0.0020833333333333333, 0.0073000000000000001)
Current iter time:  2.4
17 (0.0017333333333333333, 0.0071999999999999998)
Current iter time:  2.433333333333333
18 (0.0020333333333333332, 0.0067999999999999996)
Current iter time:  2.4166666666666665
19 (0.0014833333333333332, 0.0068999999999999999)
Current iter time:  2.35
20 (0.0015, 0.0070000000000000001)
Current iter time:  2.45






    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-45-254bc5cccb43> in <module>()
     17     for start, end in zip(range(0, X_train.shape[0], 128), 
     18                           range(128, X_train.shape[0], 128)):
---> 19         cost = train(X_train2[start:end], y_train2[start:end])
     20 
     21     errors.append((np.mean(y_train != predict(X_train2)), 

/home/al/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py in __call__(self, *args, **kwargs)
    857         t0_fn = time.time()
    858         try:
--> 859             outputs = self.fn()
    860         except Exception:
    861             if hasattr(self.fn, 'position_of_error'):

KeyboardInterrupt:



In [46]:

    
print('Total time: ', (dt.datetime.now()-t0).seconds / 60.)









    



Total time:  53.483333333333334



In [47]:

    
res = np.array(errors)
plt.plot(np.arange(res.shape[0]), res[:, 0], label='train error')
plt.plot(np.arange(res.shape[0]), res[:, 1], label='test  error')
plt.legend()









    Out[47]:





<matplotlib.legend.Legend at 0x7f543264a940>

Even more complex networks

GoogLeNet

examples