Notes on Chapter 2: Training Machine Learning Algorithms for Classification

Background

The first algorithms developed for classification tasks were the perceptron and adaptive linear neuron algorithms. These algorithms are the part of the early history of machine intelligence.

The idea of artificial neurons was (according to Raschka) first developed by Warren McCulloch and Walter Pitts in 1943 – hence the name McCulloch-Pitts (MCP) neuron. McCulloch and Pitts were inspired by developments in cognitive biology (early evidence of biology's status as an inspiration to ML -JC).

The intuition behind the MCP neuron is that a computational classifier could have a similar structure to a (vastly oversimplified) neuron:

Input -\                              1
Input --\                            /
Input -----> function --> ?threshold?
Input --/                            \ 
Input -/                              0

In 1957, Frank Rosenblatt extended this idea to what he called the perceptron, a neuron that would learn the optimal weights to apply to its input "features."

Rosenblatt's perceptron

In contrast to the MCP neuron, the architecture of Rosenblatt's perceptron looks something like this:

Input  -\  <------------------------- error: update weights         1
Weight --\                                               |         /
Input -----> net input function --> activation function -|-> output
Weight --/                                                         \ 
Input  -/                                                           -1

Here, the net input function combines inputs with weights and produces a net input, which feeds into the activation function for classification, which outputs 1 (positive) if the function is above the given threshold and -1 (negative) otherwise. The perceptron also has an updating function that measures the classification error and adjusts the weights accordingly

Formally, Rosenblatt's perceptron defines a weight vector $w$ and an input vector $x$:

$$ w = \left( \begin{array}{c} w_{1} \\ w_{2} \\ \vdots \\ w_{m} \end{array} \right), \quad x = \left( \begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{array} \right) $$

With an activation function:

$$\phi(z) = \bigg\{ \begin{array}{rr} 1 & if \> z \ge \theta \\ -1 & otherwise \end{array}$$

Where $\theta$ is the threshold (sometimes referred to as the bias) and $z$ is a linear combination (dot product) of the weight vector and the input vector:

$$z \; = \; w \cdot x \; = \; w_{1}x_{1} + w_{2}x_{2} + \; ... \; + w_{m}x_{m}$$

Hence, we can move $\theta$ to the lefthand side:

$$z \ge \theta \quad \Leftrightarrow \quad -\theta + z \ge 0$$

Include it as part of the inputs:

$$ w = \left( \begin{array}{c} -\theta \\ w_{1} \\ w_{2} \\ \vdots \\ w_{m} \end{array} \right), \quad x = \left( \begin{array}{c} 1 \\ x_{1} \\ x_{2} \\ \vdots \\ x_{m} \end{array} \right) $$

And redefine the activation function in terms of $0$:

$$\phi(z) = \bigg\{ \begin{array}{rr} 1 & if z \ge 0 \\ -1 & otherwise \end{array}$$

Updating the perceptron weights

The algorithm for updating the Rosenblatt perceptron weights is quite simple:

Set the weights to 0, or small random numbers
For each training vector $x^{(i)}$ (and its labeled output $\hat{y^{(i)}}$):
1. Compute the output class label $\hat{y^{(i)}}$
2. Update the weights based on the error, $y^{(i)} - \hat{y^{(i)}}$

Easy! So how do we update the weights? To do this, we'll sum the initial weights with an adjustment term $\Delta w_{j}$:

$$w_{j} := w_{j} + \Delta w_{j}$$

We can compute $\Delta w_{j}$ like so:

$$\Delta x_{j} = \eta \; (y^{(i)} - \hat{y^{(i)}}) \; x^{(i)}_{j}$$

Where $\eta$ is a constant learning rate coefficient on the interval $[0, 1]$, $y_{i} - \hat{y_{i}}$ is the error term, and $x^{(i)}_{j}$ is the element of the input vector at index $j$. Note that the coefficient learning_rate * error of the update value is constant for all values of the input vector $x$, but it is scaled according to the value of $x$ at index $j$, making it proportional to that value. This has the really nice property of adjusting the weights for larger values of $x_{j}$ more dramatically than for smaller values.

Note that the perceptron algorithm assumes our classes are linearly separable – that is, we can find some linear combination that cleanly divides the classes in two.

If the classes aren't linearly separable, we have two options:

Define a finite number of iterations (epoch) after which we'll stop updating the weights
Set a threshold of acceptable misclassification

These approaches aren't mutually exclusive, of course!

What if we have multiple classes?

Rosenblatt's perceptron is a binary classifier, meaning that it can only return two possible classifications: either a sample $x_{i}$ is part of a certain class (when the activation function $\phi(z) \ge 0$) or it isn't. The perceptron cannot select from more than two classes, or tell us if a sample is one class and not another – at least in this implementation.

If we want to, we can use some clever ideas from set theory to extend the perceptron to multiple classes. The technique is called One-vs-All, and the intuition goes something like this:

For each class $A$ in the set of classes $\omega$:
1. Bind $A$ to the positive output of the classifier (a success – an output of 1) and bind all other classes $A^{C}$ to the negative output (-1)
2. Classify the sample as $A$ or $A^{C}$ ("anything but $A$")
3. Compute the confidence of the classification and store it
Select the positive class label with the highest confidence, and apply that label to the sample

This process is also known as multi-label classification. Raschka doesn't go into much detail about the implementation (in particular, how to compute the confidence for any given class label), but it seems like we'll learn more about it later.

A note on the decision surface

The decision surface of a binary classifier is the "line" (or plane, or hyperplane, depending on the dimension) that divides the two classes. In Rosenblatt's perceptron, the decision surface corresponds to a linear equation of the form

$$z = 0 \quad \Leftrightarrow \quad w \cdot x = 0 \quad \Leftrightarrow \quad w_{1}x_{1} + w_{2}x_{2} + \; ... \; + w_{m}x_{m} = 0$$

where the weight vector $w$ is no longer a variable, but is instead the optimal weight coefficients learned by the algorithm. Hence, the input vector $x$ will be the only variable, and we can plot the surface to see the dividing line that the perceptron has learned from the data.

Adaptive Linear Neurons (Adaline)

The idea of the Adaptive Linear Neuron – or "Adaline" for short – was first proposed (according to Raschka) by Bernard Widrow and Ted Hoff in 1960 as an improvement on Rosenblatt's perceptron. Most importantly, the Adaline rule introduces the concept of a cost function, differentiable with respect to the weights, that we can use to mathematically minimize our classifier's error.

The Adaline rule also redefines the activation function to separate the process of calculating error from the classification itself. Formally, the Adaline rule replaces the step (or Heavisine) function $\phi(z) \ge 0$ with a linear activation function. In the basic case, the linear activation function is just the identity function:

$$\phi(z) \; = \; \phi(w \cdot x) \; = \; w \cdot x$$

which makes the Adaline rule very similar to linear regression! But since our goal is to classify data and not to fit a regression line, we introduce a step function, or quantizer, after the linear activation function that will take output and classify it. The neural structure then looks more like this:

Input -\  <------------------------------- error: update weights             1
Input --\                                                      |            /
Input -----> net input function --> linear activation function |-> quantizer
Input --/                                                                   \ 
Input -/                                                                     0

With a continuous linear activation function, Adaline has an important advantage over the perceptron: its error term is now differentiable. We'll see why this is so important once we introduce cost functions.

Cost functions

In addition to the new, continuous linear activation function, Adaline introduces the notion of a cost function, $J$, that represents the model's error term. While many cost functions are usable, Widrow and Hoff used the squared difference of the actual and predicted terms, such that the cost function can be defined as

$$ J(w) = \frac{1}{2} \sum_{i} ( y^{(i)} - \phi(z^{(i)}))^{2} $$

Where $w$ is the weight vector, $y^{(i)}$ is the actual value of the $i$th sample, and $\phi(z^{(i)})$ is the predicted value of the $i$th sample.

Three things to note in this cost function:

Thanks to the squared sum, the Adaline error term is convex. This means that we can differentiate it and find a global cost minimum for $w$ through a process called gradient descent.
The cost is calculated as a sum over all samples $i$.
The sum is multiplied by $\frac{1}{2}$ for the sake of making differentiation easier - this term does not contribute anything to the cost function from a theoretical perspective.

Gradient descent

Since we now have a convex and differentiable error term, we can use an algorithmic technique called gradient descent to find a global cost minimum that optimizes the weights $w$.

Widroff and Hoff's algorithm for gradient descent is similar to Rosenblatt's weight update rule. A crucial difference is that instead of defining $\Delta w_{j}$ for the $j$th sample, we define $\Delta w$ for the entire weight vector:

$$\Delta w = -\eta \nabla J(w)$$

Where $\eta$ is (again) the learning rate, and $- \nabla J(w)$ is the negative gradient of the cost function. The gradient is essentially the slope of a multivariate function, found by taking partial derivative of $J$ over all weights $w_{i} \in w$:

$$\nabla J(w) = \frac{\delta J}{\delta w_{j}} = - \sum_{i} ( y^{(i)} - \phi(z^{(i)}) ) x^{(i)}_{j}$$

This new $\Delta w$ gives us a slightly modified weight update rule:

$$w := w + \Delta w$$

Look familiar? The only difference between this update and the perceptron update is that now we're updating over all values of $w$ at once. For this reason, the Adaline weight update is called batch updating.

The update delta $\Delta w$ is particularly cool in gradient descent because it defines the "velocity" with which the update will occur in terms of the slope of the error curve. When the curve is steep, we're probably far from the global minimum, so we update the weights more dramatically; but as we approach a slope of 0 we're getting closer and closer, so we slow down the updates.

A brief aside on learning rates and feature scaling

Two important design decisions that need to be made in the implementation of any perceptron are:

Selecting a good learning rate
Optionally scaling features on input

The learning rate is an important hyperparameter (i.e. a parameter that is set prior to model fitting, and not learned by the algorithm) because it has a strong effect on model convergence:

If the learning rate is too large, the algorithm can overshoot the global minimum and actually end up increasing error
If the learning rate is too small, the algorithm can converge too slowly

Sometimes, to optimize the efficiency of the cost convergence, we'll want to use an adaptive learning rate that changes with every iteration. The skeleton of an adaptive learning rate looks like this:

$$ \eta = \frac{c_{1}}{n - c_{2}} $$

Where $c_{1}$ and $c_{2}$ are preset constants, and $n$ is the current iteration. With an adaptive learning rate, $\eta$ will gradually decay as $n$ gets large, further slowing down the descent as we approach the minimum.

Feature scaling describes a broad category of techniques for transforming training data before fitting the model in order to give the data properties that make it easier to work with. For Adaline, Raschka uses standardization, an algorithm for transforming input features onto the same scale and reorienting them around their mean:

$$ x'_{j} = \frac{x_{j} - \mu_{j}}{\sigma_{j}} $$

Where $x'_{j}$ is the new input value for the $j$th feature (column), $\mu_{j}$ is the sample mean for the $j$th feature, and $\sigma_{j}$ is the sample standard deviation for the $j$th feature.

Note that the specific standardized statistic that Raschka is using here is the $t$-statistic. The $t$-statistic is a form of standard score that applies to a sample, where we don't know the population parameters (the true values of $\mu$ and $\sigma$).

Choosing a good learning rate and an appropriate feature scaling algorithm are two elements of model tuning, the process by which a human makes decisions about a learning algorithm to help optimize it. See? Humans are still useful after all!

Stochastic gradient descent

Sometimes batch gradient descent is just too dang computationally expensive. If the training data contains thousands of features or millions of samples, for instance, recalcluating the cost function with respect to the entire dataset on every iteration – as we do in batch gradient descent – can take too much computing time to be useful.

Enter stochastic gradient descent (SGD), an algorithm for efficiently approximating batch gradient descent! In SGD, we calculate the cost function with respect to a single sample, such that we can update the weights one at a time for each sample. Mathematically:

$$ \Delta w = \eta ( y^{(i)} - \phi(z^{(i)}) ) x^{(i)} $$

The big change here from batch gradient descent is that we're using the error term (residual) for the $i$th sample $y^{(i)} - \phi(z^{(i)})$ in the place of the error term (residual) for the set of all samples $\sum_{i} ( y^{(i)} - \phi(z^{(i)})$.

A major advantage of SGD over batch gradient descent is that SGD is going to converge much faster on the global minimum, since we are constantly updating weights as we look at new data. A disadvantage is that the error surface will be much noisier, since we don't update it with respect to the entire input – but this also has a hidden advantage, in that we're likely to escape local minima much faster. (Raschka doesn't provide a proof of this, but I'd be curious to see one.)

Another advantage of SGD is that, since we're updating with respect to one sample instead of the entire dataset, we can efficiently stream in training data – a process called online learning. With online learning, we can update the weights on the fly as new samples come in without having to calculate the cost with respect to the entire dataset. In fact, we don't even have to store samples at all! Once we've updated the weights, we can remove the sample from memory.

A final note on SGD: an important variant halfway between batch and stochastic gradient descent is mini-batch gradient descent, which takes a cue from both algorithms. The intuition is pretty simple: instead of calculating cost with respect to a single sample, or with respect to the entire dataset, we'll define a size for subsets of the full dataset (say, where $n = 50$) and perform the gradient descent with respect to the batches:

$$ \Delta w = \eta \sum_{i}^{n} ( y^{(i)} - \phi(z^{(i)}) ) x^{(i)} $$

This approach combines the computational efficiency of SGD with the accuracy of batch descent. Nice!