Learners

In this section, we will introduce several pre-defined learners to learning the datasets by updating their weights to minimize the loss function. when using a learner to deal with machine learning problems, there are several standard steps:

  • Learner initialization: Before training the network, it usually should be initialized first. There are several choices when initializing the weights: random initialization, initializing weights are zeros or use Gaussian distribution to init the weights.

  • Optimizer specification: Which means specifying the updating rules of learnable parameters of the network. Usually, we can choose Adam optimizer as default.

  • Applying back-propagation: In neural networks, we commonly use back-propagation to pass and calculate gradient information of each layer. Back-propagation needs to be integrated with the chosen optimizer in order to update the weights of NN properly in each epoch.

  • Iterations: Iterating over the forward and back-propagation process of given epochs. Sometimes the iterating process will have to be stopped by triggering early access in case of overfitting.

We will introduce several learners with different structures. We will import all necessary packages before that:


In [1]:
import os, sys
sys.path = [os.path.abspath("../../")] + sys.path
from deep_learning4e import *
from notebook4e import *
from learning4e import *


Using TensorFlow backend.

Perceptron Learner

Overview

The Perceptron is a linear classifier. It works the same way as a neural network with no hidden layers (just input and output). First, it trains its weights given a dataset and then it can classify a new item by running it through the network.

Its input layer consists of the item features, while the output layer consists of nodes (also called neurons). Each node in the output layer has n synapses (for every item feature), each with its own weight. Then, the nodes find the dot product of the item features and the synapse weights. These values then pass through an activation function (usually a sigmoid). Finally, we pick the largest of the values and we return its index.

Note that in classification problems each node represents a class. The final classification is the class/node with the max output value.

Below you can see a single node/neuron in the outer layer. With f we denote the item features, with w the synapse weights, then inside the node we have the dot product and the activation function, g.

perceptron

Implementation

Perceptron learner is actually a neural network learner with only one hidden layer which is pre-defined in the algorithm of perceptron_learner:


In [ ]:
raw_net = [InputLayer(input_size), DenseLayer(input_size, output_size)]

Where input_size and output_size are calculated from dataset examples. In the perceptron learner, the gradient descent optimizer is used to update the weights of the network. we return a function predict which we will use in the future to classify a new item. The function computes the (algebraic) dot product of the item with the calculated weights for each node in the outer layer. Then it picks the greatest value and classifies the item in the corresponding class.

Example

Let's try the perceptron learner with the iris dataset examples, first let's regulate the dataset classes:


In [3]:
iris = DataSet(name="iris")
classes = ["setosa", "versicolor", "virginica"]
iris.classes_to_numbers(classes)

In [7]:
pl = perceptron_learner(iris, epochs=500, learning_rate=0.01, verbose=50)


epoch:50, total_loss:14.089098023560856
epoch:100, total_loss:12.439240091345326
epoch:150, total_loss:11.848151059704785
epoch:200, total_loss:11.283665595671044
epoch:250, total_loss:11.153290841913241
epoch:300, total_loss:11.00747536734494
epoch:350, total_loss:10.871093050365419
epoch:400, total_loss:10.838400319844233
epoch:450, total_loss:10.687417928867456
epoch:500, total_loss:10.650371951865573

We can see from the printed lines that the final total loss is converged to around 10.50. If we check the error ratio of perceptron learner on the dataset after training, we will see it is much higher than randomly guess:


In [8]:
print(err_ratio(pl, iris))


0.046666666666666634

If we test the trained learner with some test cases:


In [14]:
tests = [([5.0, 3.1, 0.9, 0.1], 0),
        ([5.1, 3.5, 1.0, 0.0], 0),
        ([4.9, 3.3, 1.1, 0.1], 0),
        ([6.0, 3.0, 4.0, 1.1], 1),
        ([6.1, 2.2, 3.5, 1.0], 1),
        ([5.9, 2.5, 3.3, 1.1], 1),
        ([7.5, 4.1, 6.2, 2.3], 2),
        ([7.3, 4.0, 6.1, 2.4], 2),
        ([7.0, 3.3, 6.1, 2.5], 2)]
print(grade_learner(pl, tests))


1

It seems the learner is correct on all the test examples.

Now let's try perceptron learner on a more complicated dataset: the MNIST dataset, to see what the result will be. First, we import the dataset to make the examples a Dataset object:


In [11]:
train_img, train_lbl, test_img, test_lbl = load_MNIST(path="../../aima-data/MNIST/Digits")
import numpy as np
import matplotlib.pyplot as plt
train_examples = [np.append(train_img[i], train_lbl[i]) for i in range(len(train_img))]
test_examples = [np.append(test_img[i], test_lbl[i]) for i in range(len(test_img))]
print("length of training dataset:", len(train_examples))
print("length of test dataset:", len(test_examples))


length of training dataset: 60000
length of test dataset: 10000

Now let's train the perceptron learner on the first 1000 examples of the dataset:


In [12]:
mnist = DataSet(examples=train_examples[:1000])
pl = perceptron_learner(mnist, epochs=10, verbose=1)


epoch:1, total_loss:423.8627535296463
epoch:2, total_loss:341.31697581698995
epoch:3, total_loss:328.98647291325443
epoch:4, total_loss:327.8999700915627
epoch:5, total_loss:310.081065570072
epoch:6, total_loss:268.5474616202945
epoch:7, total_loss:259.0999998773958
epoch:8, total_loss:259.09999987481393
epoch:9, total_loss:259.09999987211944
epoch:10, total_loss:259.0999998693056

In [21]:
print(err_ratio(pl, mnist))


0.893

It looks like we have a near 90% error ratio on training data after the network is trained on it. Then we can investigate the model's performance on the test dataset which it never has seen before:


In [33]:
test_mnist = DataSet(examples=test_examples[:100])
print(err_ratio(pl, test_mnist))


0.92

It seems a single layer perceptron learner cannot simulate the structure of the MNIST dataset. To improve accuracy, we may not only increase training epochs but also consider changing to a more complicated network structure.

Neural Network Learner

Although there are many different types of neural networks, the dense neural network we implemented can be treated as a stacked perceptron learner. Adding more layers to the perceptron network could add to the non-linearity to the network thus model will be more flexible when fitting complex data-target relations. Whereas it also adds to the risk of overfitting as the side effect of flexibility.

By default we use dense networks with two hidden layers, which has the architecture as the following:

In our code, we implemented it as:


In [ ]:
# initialize the network
raw_net = [InputLayer(input_size)]
# add hidden layers
hidden_input_size = input_size
for h_size in hidden_layer_sizes:
    raw_net.append(DenseLayer(hidden_input_size, h_size))
    hidden_input_size = h_size
raw_net.append(DenseLayer(hidden_input_size, output_size))

Where hidden_layer_sizes are the sizes of each hidden layer in a list which can be specified by user. Neural network learner uses gradient descent as default optimizer but user can specify any optimizer when calling neural_net_learner. The other special attribute that can be changed in neural_net_learner is batch_size which controls the number of examples used in each round of update. neural_net_learner also returns a predict function which calculates prediction by multiplying weight to inputs and applying activation functions.

Example

Let's also try neural_net_learner on the iris dataset:


In [4]:
nn = neural_net_learner(iris, epochs=100, learning_rate=0.15, optimizer=gradient_descent, verbose=10)


epoch:10, total_loss:15.931817841643683
epoch:20, total_loss:8.248422285412149
epoch:30, total_loss:6.102968668275
epoch:40, total_loss:5.463915043272969
epoch:50, total_loss:5.298986288420822
epoch:60, total_loss:4.032928400456889
epoch:70, total_loss:3.2628899927346855
epoch:80, total_loss:6.01336701367312
epoch:90, total_loss:5.412020420311795
epoch:100, total_loss:3.1044027319850773

Similarly we check the model's accuracy on both training and test dataset:


In [6]:
print("error ration on training set:",err_ratio(nn, iris))


error ration on training set: 0.033333333333333326

In [8]:
tests = [([5.0, 3.1, 0.9, 0.1], 0),
        ([5.1, 3.5, 1.0, 0.0], 0),
        ([4.9, 3.3, 1.1, 0.1], 0),
        ([6.0, 3.0, 4.0, 1.1], 1),
        ([6.1, 2.2, 3.5, 1.0], 1),
        ([5.9, 2.5, 3.3, 1.1], 1),
        ([7.5, 4.1, 6.2, 2.3], 2),
        ([7.3, 4.0, 6.1, 2.4], 2),
        ([7.0, 3.3, 6.1, 2.5], 2)]
print("accuracy on test set:",grade_learner(nn, tests))


accuracy on test set: 1

We can see that the error ratio on the training set is smaller than the perceptron learner. As the error ratio is relatively small, let's try the model on the MNIST dataset to see whether there will be a larger difference.


In [15]:
nn = neural_net_learner(mnist, epochs=100, verbose=10)


epoch:10, total_loss:89.0002153455983
epoch:20, total_loss:87.29675663038348
epoch:30, total_loss:86.29591779319225
epoch:40, total_loss:83.78091780128402
epoch:50, total_loss:82.17091581738829
epoch:60, total_loss:83.8434277386084
epoch:70, total_loss:83.55209905561495
epoch:80, total_loss:83.106898191118
epoch:90, total_loss:83.37041170165992
epoch:100, total_loss:82.57013813500876

In [16]:
print(err_ratio(nn, mnist))


0.784

After the model converging, the model's error ratio on the training set is still high. We will introduce the convolutional network in the following chapters to see how it helps improve accuracy on learning this dataset.