Sebastian Raschka, 2015

Python Machine Learning

Chapter 13 - Parallelizing Neural Network Training with Theano

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).



In [1]:

    
%load_ext watermark
%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,matplotlib,theano,keras









    



Sebastian Raschka 
Last updated: 08/27/2015 

CPython 3.4.3
IPython 4.0.0

numpy 1.9.2
matplotlib 1.4.3
theano 0.7.0
keras 0.1.2



In [2]:

    
# to install watermark just uncomment the following line:
#%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py

Sections

Building, compiling, and running expressions with Theano
Choosing activation functions for feedforward neural networks

Building, compiling, and running expressions with Theano

[back to top]

Depending on your system setup, it is typically sufficient to install Theano via

pip install Theano

For more help with the installation, please see: http://deeplearning.net/software/theano/install.html

First steps with Theano

Introducing the TensorType variables. For a complete list, see http://deeplearning.net/software/theano/library/tensor/basic.html#all-fully-typed-constructors



In [3]:

    
import theano
from theano import tensor as T



In [4]:

    
# initialize
x1 = T.scalar()
w1 = T.scalar()
w0 = T.scalar()
z1 = w1 * x1 + w0

# compile
net_input = theano.function(inputs=[w1, x1, w0], outputs=z1)

# execute
net_input(2.0, 1.0, 0.5)









    Out[4]:





array(2.5)

Configuring Theano

[back to top]

Configuring Theano. For more options, see



In [5]:

    
print(theano.config.floatX)









    



float64



In [6]:

    
theano.config.floatX = 'float32'

To change the float type globally, execute

export THEANO_FLAGS=floatX=float32

in your bash shell. Or execute Python script as

THEANO_FLAGS=floatX=float32 python your_script.py

Running Theano on GPU(s). For prerequisites, please see: http://deeplearning.net/software/theano/tutorial/using_gpu.html

Note that float32 is recommended for GPUs; float64 on GPUs is currently still relatively slow.



In [7]:

    
print(theano.config.device)

cpu

You can run a Python script on CPU via:

THEANO_FLAGS=device=cpu,floatX=float64 python your_script.py

or GPU via

THEANO_FLAGS=device=gpu,floatX=float32 python your_script.py

It may also be convenient to create a .theanorc file in your home directory to make those configurations permanent. For example, to always use float32, execute

echo -e "\n[global]\nfloatX=float32\n" >> ~/.theanorc

Or, create a .theanorc file manually with the following contents

[global]
floatX = float32
device = gpu

Working with array structures

[back to top]



In [13]:

    
import numpy as np

# initialize
x = T.fmatrix(name='x')
x_sum = T.sum(x, axis=0)

# compile
calc_sum = theano.function(inputs=[x], outputs=x_sum)

# execute (Python list)
ary = [[1, 2, 3], [1, 2, 3]]
print('Column sum:', calc_sum(ary))

# execute (NumPy array)
ary = np.array([[1, 2, 3], [1, 2, 3]], dtype=theano.config.floatX)
print('Column sum:', calc_sum(ary))









    



Column sum: [ 2.  4.  6.]
Column sum: [ 2.  4.  6.]

Updating shared arrays. More info about memory management in Theano can be found here: http://deeplearning.net/software/theano/tutorial/aliasing.html



In [9]:

    
# initialize
x = T.fmatrix(name='x')
w = theano.shared(np.asarray([[0.0, 0.0, 0.0]], 
                             dtype=theano.config.floatX))
z = x.dot(w.T)
update = [[w, w + 1.0]]

# compile
net_input = theano.function(inputs=[x], 
                            updates=update, 
                            outputs=z)

# execute
data = np.array([[1, 2, 3]], dtype=theano.config.floatX)
for i in range(5):
    print('z%d:' % i, net_input(data))









    



z0: [[ 0.]]
z1: [[ 6.]]
z2: [[ 12.]]
z3: [[ 18.]]
z4: [[ 24.]]

We can use the givens variable to insert values into the graph before compiling it. Using this approach we can reduce the number of transfers from RAM (via CPUs) to GPUs to speed up learning with shared variables. If we use inputs, a datasets is transferred from the CPU to the GPU multiple times, for example, if we iterate over a dataset multiple times (epochs) during gradient descent. Via givens, we can keep the dataset on the GPU if it fits (e.g., a mini-batch).



In [10]:

    
# initialize
data = np.array([[1, 2, 3]], 
                dtype=theano.config.floatX)
x = T.fmatrix(name='x')
w = theano.shared(np.asarray([[0.0, 0.0, 0.0]], 
                             dtype=theano.config.floatX))
z = x.dot(w.T)
update = [[w, w + 1.0]]

# compile
net_input = theano.function(inputs=[], 
                            updates=update, 
                            givens={x: data},
                            outputs=z)

# execute
for i in range(5):
    print('z:', net_input())









    



z: [[ 0.]]
z: [[ 6.]]
z: [[ 12.]]
z: [[ 18.]]
z: [[ 24.]]

Wrapping things up: A linear regression example

[back to top]

Creating some training data.



In [20]:

    
import numpy as np
X_train = np.asarray([[0.0], [1.0], [2.0], [3.0], [4.0],
                      [5.0], [6.0], [7.0], [8.0], [9.0]], 
                     dtype=theano.config.floatX)

y_train = np.asarray([1.0, 1.3, 3.1, 2.0, 5.0, 
                      6.3, 6.6, 7.4, 8.0, 9.0], 
                     dtype=theano.config.floatX)

Implementing the training function.



In [21]:

    
import theano
from theano import tensor as T
import numpy as np

def train_linreg(X_train, y_train, eta, epochs):

    costs = []
    # Initialize arrays
    eta0 = T.fscalar('eta0')
    y = T.fvector(name='y') 
    X = T.fmatrix(name='X')   
    w = theano.shared(np.zeros(
                        shape=(X_train.shape[1] + 1),
                        dtype=theano.config.floatX),
                      name='w')
    
    # calculate cost
    net_input = T.dot(X, w[1:]) + w[0]
    errors = y - net_input
    cost = T.sum(T.pow(errors, 2)) 

    # perform gradient update
    gradient = T.grad(cost, wrt=w)
    update = [(w, w - eta0 * gradient)]

    # compile model
    train = theano.function(inputs=[eta0],
                            outputs=cost,
                            updates=update,
                            givens={X: X_train,
                                    y: y_train,})      
    
    for _ in range(epochs):
        costs.append(train(eta))
    
    return costs, w

Plotting the sum of squared errors cost vs epochs.



In [22]:

    
%matplotlib inline
import matplotlib.pyplot as plt

costs, w = train_linreg(X_train, y_train, eta=0.001, epochs=10)
   
plt.plot(range(1, len(costs)+1), costs)

plt.tight_layout()
plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.tight_layout()
# plt.savefig('./figures/cost_convergence.png', dpi=300)
plt.show()

Making predictions.



In [23]:

    
def predict_linreg(X, w):
    Xt = T.matrix(name='X')
    net_input = T.dot(Xt, w[1:]) + w[0]
    predict = theano.function(inputs=[Xt], givens={w: w}, outputs=net_input)
    return predict(X)

plt.scatter(X_train, y_train, marker='s', s=50)
plt.plot(range(X_train.shape[0]), 
         predict_linreg(X_train, w), 
         color='gray', 
         marker='o', 
         markersize=4, 
         linewidth=3)

plt.xlabel('x')
plt.ylabel('y')

plt.tight_layout()
# plt.savefig('./figures/linreg.png', dpi=300)
plt.show()

Choosing activation functions for feedforward neural networks

[back to top]

Logistic function recap

[back to top]

The logistic function, often just called "sigmoid function" is in fact a special case of a sigmoid function.

Net input $z$: $$z = w_1x_{1} + \dots + w_mx_{m} = \sum_{j=1}^{m} x_{j}w_{j} \\ = \mathbf{w}^T\mathbf{x}$$

Logistic activation function:

$$\phi_{logistic}(z) = \frac{1}{1 + e^{-z}}$$

Output range: (0, 1)



In [29]:

    
# note that first element (X[0] = 1) to denote bias unit

X = np.array([[1, 1.4, 1.5]])
w = np.array([0.0, 0.2, 0.4])

def net_input(X, w):
    z = X.dot(w)
    return z

def logistic(z):
    return 1.0 / (1.0 + np.exp(-z))

def logistic_activation(X, w):
    z = net_input(X, w)
    return logistic(z)

print('P(y=1|x) = %.3f' % logistic_activation(X, w)[0])









    



P(y=1|x) = 0.707

Now, imagine a MLP perceptron with 3 hidden units + 1 bias unit in the hidden unit. The output layer consists of 3 output units.



In [30]:

    
# W : array, shape = [n_output_units, n_hidden_units+1]
#          Weight matrix for hidden layer -> output layer.
# note that first column (A[:][0] = 1) are the bias units
W = np.array([[1.1, 1.2, 1.3, 0.5],
              [0.1, 0.2, 0.4, 0.1],
              [0.2, 0.5, 2.1, 1.9]])

# A : array, shape = [n_hidden+1, n_samples]
#          Activation of hidden layer.
# note that first element (A[0][0] = 1) is for the bias units

A = np.array([[1.0], 
              [0.1], 
              [0.3], 
              [0.7]])

# Z : array, shape = [n_output_units, n_samples]
#          Net input of output layer.

Z = W.dot(A) 
y_probas = logistic(Z)
print('Probabilities:\n', y_probas)









    



Probabilities:
 [[ 0.87653295]
 [ 0.57688526]
 [ 0.90114393]]



In [31]:

    
y_class = np.argmax(Z, axis=0)
print('predicted class label: %d' % y_class[0])









    



predicted class label: 2

Estimating probabilities in multi-class classification via the softmax function

[back to top]

The softmax function is a generalization of the logistic function and allows us to compute meaningful class-probalities in multi-class settings (multinomial logistic regression).

$$P(y=j|z) =\phi_{softmax}(z) = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$

the input to the function is the result of K distinct linear functions, and the predicted probability for the j'th class given a sample vector x is:

Output range: (0, 1)



In [32]:

    
def softmax(z): 
    return np.exp(z) / np.sum(np.exp(z))

def softmax_activation(X, w):
    z = net_input(X, w)
    return sigmoid(z)



In [33]:

    
y_probas = softmax(Z)
print('Probabilities:\n', y_probas)









    



Probabilities:
 [[ 0.40386493]
 [ 0.07756222]
 [ 0.51857284]]



In [34]:

    
y_probas.sum()









    Out[34]:





1.0



In [35]:

    
y_class = np.argmax(Z, axis=0)
y_class









    Out[35]:





array([2])

Broadening the output spectrum using a hyperbolic tangent

[back to top]

Another special case of a sigmoid function, it can be interpreted as a rescaled version of the logistic function.

$$\phi_{tanh}(z) = \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}$$

Output range: (-1, 1)



In [36]:

    
def tanh(z):
    e_p = np.exp(z) 
    e_m = np.exp(-z)
    return (e_p - e_m) / (e_p + e_m)



In [37]:

    
import matplotlib.pyplot as plt
%matplotlib inline

z = np.arange(-5, 5, 0.005)
log_act = logistic(z)
tanh_act = tanh(z)

# alternatives:
# from scipy.special import expit
# log_act = expit(z)
# tanh_act = np.tanh(z)

plt.ylim([-1.5, 1.5])
plt.xlabel('net input $z$')
plt.ylabel('activation $\phi(z)$')
plt.axhline(1, color='black', linestyle='--')
plt.axhline(0.5, color='black', linestyle='--')
plt.axhline(0, color='black', linestyle='--')
plt.axhline(-1, color='black', linestyle='--')

plt.plot(z, tanh_act, 
         linewidth=2, 
         color='black', 
         label='tanh')
plt.plot(z, log_act, 
         linewidth=2, 
         color='lightgreen', 
         label='logistic')

plt.legend(loc='lower right')
plt.tight_layout()
# plt.savefig('./figures/activation.png', dpi=300)
plt.show()

Keras

[back to top]

Loading MNIST

1) Download the 4 MNIST datasets from http://yann.lecun.com/exdb/mnist/

train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

2) Unzip those files

3 Copy the unzipped files to a directory ./mnist



In [33]:

    
import os
import struct
import numpy as np
 
def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path, 
                               '%s-labels-idx1-ubyte' 
                                % kind)
    images_path = os.path.join(path, 
                               '%s-images-idx3-ubyte' 
                               % kind)
        
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', 
                                 lbpath.read(8))
        labels = np.fromfile(lbpath, 
                             dtype=np.uint8)

    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", 
                                               imgpath.read(16))
        images = np.fromfile(imgpath, 
                             dtype=np.uint8).reshape(len(labels), 784)
 
    return images, labels



In [34]:

    
X_train, y_train = load_mnist('mnist', kind='train')
print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1]))









    



Rows: 60000, columns: 784



In [35]:

    
X_test, y_test = load_mnist('mnist', kind='t10k')
print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1]))









    



Rows: 10000, columns: 784

Multi-layer Perceptron in Keras

Once you have Theano installed, Keras can be installed via

pip install Keras

In order to run the following code via GPU, you can execute the Python script that was placed in this directory via

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_keras_mlp.py



In [43]:

    
import theano 

theano.config.floatX = 'float32'
X_train = X_train.astype(theano.config.floatX)
X_test = X_test.astype(theano.config.floatX)

One-hot encoding of the class variable:



In [38]:

    
from keras.utils import np_utils

print('First 3 labels: ', y_train[:3])

y_train_ohe = np_utils.to_categorical(y_train) 
print('\nFirst 3 labels (one-hot):\n', y_train_ohe[:3])









    



First 3 labels:  [5 0 4]

First 3 labels (one-hot):
 [[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]]



In [49]:

    
from keras.models import Sequential
from keras.layers.core import Dense
from keras.optimizers import SGD

np.random.seed(1) 

model = Sequential()
model.add(Dense(input_dim=X_train.shape[1], 
                output_dim=50, 
                init='uniform', 
                activation='tanh'))

model.add(Dense(input_dim=50, 
                output_dim=50, 
                init='uniform', 
                activation='tanh'))

model.add(Dense(input_dim=50, 
                output_dim=y_train_ohe.shape[1], 
                init='uniform', 
                activation='softmax'))

sgd = SGD(lr=0.001, decay=1e-7, momentum=.9)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

model.fit(X_train, y_train_ohe, 
          nb_epoch=50, 
          batch_size=300, 
          verbose=1, 
          validation_split=0.1, 
          show_accuracy=True)









    



Train on 54000 samples, validate on 6000 samples
Epoch 0
54000/54000 [==============================] - 1s - loss: 2.2290 - acc: 0.3592 - val_loss: 2.1094 - val_acc: 0.5342
Epoch 1
54000/54000 [==============================] - 1s - loss: 1.8850 - acc: 0.5279 - val_loss: 1.6098 - val_acc: 0.5617
Epoch 2
54000/54000 [==============================] - 1s - loss: 1.3903 - acc: 0.5884 - val_loss: 1.1666 - val_acc: 0.6707
Epoch 3
54000/54000 [==============================] - 1s - loss: 1.0592 - acc: 0.6936 - val_loss: 0.8961 - val_acc: 0.7615
Epoch 4
54000/54000 [==============================] - 1s - loss: 0.8528 - acc: 0.7666 - val_loss: 0.7288 - val_acc: 0.8290
Epoch 5
54000/54000 [==============================] - 1s - loss: 0.7187 - acc: 0.8191 - val_loss: 0.6122 - val_acc: 0.8603
Epoch 6
54000/54000 [==============================] - 1s - loss: 0.6278 - acc: 0.8426 - val_loss: 0.5347 - val_acc: 0.8762
Epoch 7
54000/54000 [==============================] - 1s - loss: 0.5592 - acc: 0.8621 - val_loss: 0.4707 - val_acc: 0.8920
Epoch 8
54000/54000 [==============================] - 1s - loss: 0.4978 - acc: 0.8751 - val_loss: 0.4288 - val_acc: 0.9033
Epoch 9
54000/54000 [==============================] - 1s - loss: 0.4583 - acc: 0.8847 - val_loss: 0.3935 - val_acc: 0.9035
Epoch 10
54000/54000 [==============================] - 1s - loss: 0.4213 - acc: 0.8911 - val_loss: 0.3553 - val_acc: 0.9088
Epoch 11
54000/54000 [==============================] - 1s - loss: 0.3972 - acc: 0.8955 - val_loss: 0.3405 - val_acc: 0.9083
Epoch 12
54000/54000 [==============================] - 1s - loss: 0.3740 - acc: 0.9022 - val_loss: 0.3251 - val_acc: 0.9170
Epoch 13
54000/54000 [==============================] - 1s - loss: 0.3611 - acc: 0.9030 - val_loss: 0.3032 - val_acc: 0.9183
Epoch 14
54000/54000 [==============================] - 1s - loss: 0.3479 - acc: 0.9064 - val_loss: 0.2972 - val_acc: 0.9248
Epoch 15
54000/54000 [==============================] - 1s - loss: 0.3309 - acc: 0.9099 - val_loss: 0.2778 - val_acc: 0.9250
Epoch 16
54000/54000 [==============================] - 1s - loss: 0.3264 - acc: 0.9103 - val_loss: 0.2838 - val_acc: 0.9208
Epoch 17
54000/54000 [==============================] - 1s - loss: 0.3136 - acc: 0.9136 - val_loss: 0.2689 - val_acc: 0.9223
Epoch 18
54000/54000 [==============================] - 1s - loss: 0.3031 - acc: 0.9156 - val_loss: 0.2634 - val_acc: 0.9313
Epoch 19
54000/54000 [==============================] - 1s - loss: 0.2988 - acc: 0.9169 - val_loss: 0.2579 - val_acc: 0.9288
Epoch 20
54000/54000 [==============================] - 1s - loss: 0.2909 - acc: 0.9180 - val_loss: 0.2494 - val_acc: 0.9310
Epoch 21
54000/54000 [==============================] - 1s - loss: 0.2848 - acc: 0.9202 - val_loss: 0.2478 - val_acc: 0.9307
Epoch 22
54000/54000 [==============================] - 1s - loss: 0.2804 - acc: 0.9194 - val_loss: 0.2423 - val_acc: 0.9343
Epoch 23
54000/54000 [==============================] - 1s - loss: 0.2728 - acc: 0.9235 - val_loss: 0.2387 - val_acc: 0.9327
Epoch 24
54000/54000 [==============================] - 1s - loss: 0.2673 - acc: 0.9241 - val_loss: 0.2265 - val_acc: 0.9385
Epoch 25
54000/54000 [==============================] - 1s - loss: 0.2611 - acc: 0.9253 - val_loss: 0.2270 - val_acc: 0.9347
Epoch 26
54000/54000 [==============================] - 1s - loss: 0.2676 - acc: 0.9225 - val_loss: 0.2210 - val_acc: 0.9367
Epoch 27
54000/54000 [==============================] - 1s - loss: 0.2528 - acc: 0.9261 - val_loss: 0.2241 - val_acc: 0.9373
Epoch 28
54000/54000 [==============================] - 1s - loss: 0.2511 - acc: 0.9264 - val_loss: 0.2170 - val_acc: 0.9403
Epoch 29
54000/54000 [==============================] - 1s - loss: 0.2433 - acc: 0.9293 - val_loss: 0.2165 - val_acc: 0.9412
Epoch 30
54000/54000 [==============================] - 1s - loss: 0.2465 - acc: 0.9279 - val_loss: 0.2135 - val_acc: 0.9367
Epoch 31
54000/54000 [==============================] - 1s - loss: 0.2383 - acc: 0.9306 - val_loss: 0.2138 - val_acc: 0.9427
Epoch 32
54000/54000 [==============================] - 1s - loss: 0.2349 - acc: 0.9310 - val_loss: 0.2066 - val_acc: 0.9423
Epoch 33
54000/54000 [==============================] - 1s - loss: 0.2301 - acc: 0.9334 - val_loss: 0.2054 - val_acc: 0.9440
Epoch 34
54000/54000 [==============================] - 1s - loss: 0.2371 - acc: 0.9317 - val_loss: 0.1991 - val_acc: 0.9480
Epoch 35
54000/54000 [==============================] - 1s - loss: 0.2256 - acc: 0.9352 - val_loss: 0.1982 - val_acc: 0.9450
Epoch 36
54000/54000 [==============================] - 1s - loss: 0.2313 - acc: 0.9323 - val_loss: 0.2092 - val_acc: 0.9403
Epoch 37
54000/54000 [==============================] - 1s - loss: 0.2230 - acc: 0.9341 - val_loss: 0.1993 - val_acc: 0.9445
Epoch 38
54000/54000 [==============================] - 1s - loss: 0.2261 - acc: 0.9336 - val_loss: 0.1891 - val_acc: 0.9463
Epoch 39
54000/54000 [==============================] - 1s - loss: 0.2166 - acc: 0.9369 - val_loss: 0.1943 - val_acc: 0.9452
Epoch 40
54000/54000 [==============================] - 1s - loss: 0.2128 - acc: 0.9370 - val_loss: 0.1952 - val_acc: 0.9435
Epoch 41
54000/54000 [==============================] - 1s - loss: 0.2200 - acc: 0.9351 - val_loss: 0.1918 - val_acc: 0.9468
Epoch 42
54000/54000 [==============================] - 2s - loss: 0.2107 - acc: 0.9383 - val_loss: 0.1831 - val_acc: 0.9483
Epoch 43
54000/54000 [==============================] - 1s - loss: 0.2020 - acc: 0.9411 - val_loss: 0.1906 - val_acc: 0.9443
Epoch 44
54000/54000 [==============================] - 1s - loss: 0.2082 - acc: 0.9388 - val_loss: 0.1838 - val_acc: 0.9457
Epoch 45
54000/54000 [==============================] - 1s - loss: 0.2048 - acc: 0.9402 - val_loss: 0.1817 - val_acc: 0.9488
Epoch 46
54000/54000 [==============================] - 1s - loss: 0.2012 - acc: 0.9417 - val_loss: 0.1876 - val_acc: 0.9480
Epoch 47
54000/54000 [==============================] - 1s - loss: 0.1996 - acc: 0.9423 - val_loss: 0.1792 - val_acc: 0.9502
Epoch 48
54000/54000 [==============================] - 1s - loss: 0.1921 - acc: 0.9430 - val_loss: 0.1791 - val_acc: 0.9505
Epoch 49
54000/54000 [==============================] - 1s - loss: 0.1907 - acc: 0.9432 - val_loss: 0.1749 - val_acc: 0.9482






    Out[49]:





<keras.callbacks.History at 0x10df582e8>



In [50]:

    
y_train_pred = model.predict_classes(X_train, verbose=0)
print('First 3 predictions: ', y_train_pred[:3])









    



First 3 predictions:  [5 0 4]



In [51]:

    
train_acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0]
print('Training accuracy: %.2f%%' % (train_acc * 100))









    



Training accuracy: 94.51%



In [53]:

    
y_test_pred = model.predict_classes(X_test, verbose=0)
test_acc = np.sum(y_test == y_test_pred, axis=0) / X_test.shape[0]
print('Test accuracy: %.2f%%' % (test_acc * 100))









    



Test accuracy: 94.39%



In [ ]: