Chapter 3: A Tour of Machine Learning Classifiers

General principles

Selecting a classifier is a difficult but critical task in machine learning! For any given task, it's good practice to compare different models and test them for performance, since no single classifier is universally the "best." All involve costs and benefits for any application.

Some common features of a dataset to consider when choosing models:

How noisy is the dataset?
Are the classes linearly separable (as is necessary for perceptron and Adaline, for example)?
How many dimensions are you working with? (How "big" is your data?)

Five steps to train any algorithm

Select features (and optionally perform feature scaling)
Choose a performance metric (cost function) so you know how well your model is doing
Choose one or more classifiers (Adaline? Perceptron?) and optimization algorithms (batch descent? SGD?)
Evaluate the performance of your model(s)
Tune your hyperparameters (epoch, learning rate)

New classifiers

Logistic regresion

Logistic regression is a classification algorithm very closely related to Adaline and the perceptron. It's also a linear model for binary classification, and it follows the same neural architecture:

Input  -\  <------------------------- error: update weights            1
Weight --\                                               |            /
Input -----> net input function --> activation function -|-> quantizer
Weight --/                                                            \ 
Input  -/                                                              -1

In a logistic regression model, however, the activation function and cost function are very different from the two learning algorithms we've studied! Changing these two functions, as we'll see, will have the nice effect of allowing us to make good classifications even when the data is not perfectly linearly separable.

Activation: the logistic (sigmoid) function

The activation function in logistic regression is (shockingly!) called the logistic function, or sometimes the "sigmoid function" in reference to its curvy shape. The intuition behind the logistic function is that it takes inputs as any number on the real line, and squishes them onto the interval $[0, 1]$. When performed correctly, this has the really cool property of returning the confidence of a classification by allowing us to interpret the output of the function as a probability of the class label $\hat{y_{i}} = 1$.

To derive the logistic function, let's start by defining the odds function, that takes the probability, $p$, of an event, and returns a proportion of how much more likely that event is to happen ($p$) than to not happen ($1-p$):

$$odds = \frac{p}{(1-p)}$$

where in our case, $p$ represents the probability that $\hat{y_{i}} = 1$.

One way of mapping this probability to the real line is the logit function, which is the natural log of the odds:

$$logit(p) = log(odds) = log(\frac{p}{1-p})$$

But for the purposes of classification, we're more interested in the inverse of the logit function, the logistic function, which will map numbers from the real line to probabilities:

$$logistic(z) = \phi(z) = \frac{1}{1 + e^{-z}}$$

Here, as in the previous learning algorithms we've studied, $z$ is a linear combination of the weight vector and the input vector, $w^{T}z$. As such, we can interpret the logistic function as taking in the net input function and returning the probability that the sample has a class label of 1, given $x$ weighted by $w$:

$$\phi(z_{i}) = P(y_{i}=1 \mid x ; w)$$

Quantizer: same as it ever was

On first glance, the quantizer function for logistic regression looks different from the quantizer functions we saw in chapter 2:

$$\hat{y} = \bigg\{ \begin{array}{ll} 1 & if \> \phi(z) \ge 0.5 \\ 0 & otherwise \end{array}$$

Immediately, two characteristics of this function jump out as different:

The positive class label requires that $\phi(z)$ be greater than $0.5$, not $0$
The possible class labels are $[0, 1]$, not $[-1, -1]$

Point #2 is indeed different, but makes little difference: we define the negative class to be $0$ because it better fits the properties of the logistic function, not for any deep purpose.

But on closer inspection, point #1 is revealed to be a red herring! Since $\phi(0) = 0.5$ and $\phi(z)$ is monotonically increasing, we can prove that

$$\hat{y} = \bigg\{ \begin{array}{ll} 1 & if \> z \ge 0 \\ 0 & otherwise \end{array}$$

Which is identical to the quantizer used in Adaline/perceptron learning, just with a new negative class label!

Cost function: brand new territory

Since logistic regresion deals with odds, our goal in defining a cost function will be to maximize the likelihood of a correct classification. We'll achieve this by defining likelihood of correct classification, by taking the natural log of that likelihood (which will be easier to differentiate), and then inverting the signs to instead minimize the likelihood of an incorrect classification.

Let's start by defining the likelihood of a correct classification, $L(w)$:

$$L(w) = P(y \mid x ; w)$$

By assuming that our samples are independent, we can reason that this is equivalent to taking the cumulative product across samples:

$$L(w) = \prod_{i=1}^{n} P \left( y^{(i)} \mid x^{(i)} ; w \right)$$

And finally, we'll sub in our activation function to define the probabilities in terms of the net input $z$:

$$L(z^{(i)}) = \prod_{i=1}^{n} \left( \phi(z^{(i)} \right)^{y^{(i)}} \left( 1 - \phi(z^{(i)}) \right)^{(1 - y^{(i)})}$$

Notice that the "incorrect" term cancels based on the value of the class label $y^{(i)}$. When the class label is positive ($1$), we return the positive probability $\phi(z^{(i)}$ for the sample, since the other term evaluates to $1$; when the class label is negative ($0$), we return the negative probability $(1 - \phi(z^{(i)}))$. Mathematically, we get the following stepwise function for the likelihood:

$$L(z) = \bigg\{ \begin{array}{ll} \phi(z) & if \> y = 1 \\ 1 - \phi(z) & otherwise \end{array}$$

It's hard to maximize over products, however, so instead of using the likelihood, we'll use the log-likelihood, $l(w) = log(L(w))$. (The log-likelihood also helps prevent numeric underflow when we have extremely small likelihoods.) Using properties of logs, we can simplify $log(L(w))$ like so:

$$l(w) = \sum_{i=1}^{n} \bigg [ y^{(i)}log(\phi(z^{(i)})) + (1 - y^{(i)})log( 1 - \phi(z^{(i)})) \bigg ]$$

But since minimization is easier, we'll actually want to minimize the negative log-likelihood – our final cost function $J$!

$$J(w) = -l(w) = \sum_{i=1}^{n} \bigg [ -y^{(i)}log(\phi(z^{(i)})) - (1 - y^{(i)})log( 1 - \phi(z^{(i)})) \bigg ]$$

Overfitting and underfitting (bias and variance)

"Overfitting" is a term we can use to describe a model that fits the training data "too well," such that it doesn't generalize well to new samples. (The opposite of overfitting is underfitting, where a model is not complex enough – usually meaning that it is not a high enough degree polynomial – to capture the underlying process that produces the data.)

Another way of describing problems with a model due to overfitting/underfitting is to describe the model as high variance (overfit) or high bias (underfit). Both terms describe consistency:

A high variance model has inconsistent predictions, meaning that when we retrain the model on different subsets of the training data we get different predictions, since our model is fitting itself to the noise in the data and not the underlying phenomenon. (While "variance" is a summary statistic often used to describe a dataset, in this case we mean to describe the wide range of different predictions that the model produces when trained on different subsets.)
A high bias model has consistent residuals, meaning that when we retrain the model on different subsets of training data we get errors (or costs) that follow a pattern. As in the case of a "biased estimator," residuals following their own pattern are an indication that our errors are not due to the randomness of our training data, but rather due to our classifier's inability to model the true phenomenon underlying the data.

Bias and variance are often seen as sources of error that need to be balanced in the training of a model. Hence, we use the phrase bias/variance tradeoff to describe the way that decreasing bias can often increase variance, and vice versa. For an in-depth explanation of the bias/variance tradeoff (with good visual aids to boot), see Scott Fortmann-Roe's excellent breakdown of the topic.

L2 Regularization

When training classifiers, regularization can be a useful method for finding a good balance between bias and variance. Raschka covers L2 regularization, a specific algorithm that adds noise (bias) to a model, in the form of a proportional squared term, to penalize extreme weights while learning.

Under L2 Regularization, we'll tweak our cost function like so:

$$ J(w) := J(w) + \frac{\lambda}{2} \mid \mid x \mid \mid^{2} $$

Where $\mid \mid x \mid \mid$ is the length of the weight vector, $\sqrt{\sum_{i=1}^{n} w_{j}^2}$, and $\lambda$ is the regularization parameter that controls the strength of the regularization. The stronger the regularization parameter (where $\lambda$ is large), the more drastic the update will be when the values of the weight vector $w$ are large. (Note that L2 regularization requires scaled features, since the weights need to all be on the same scale for the cost function to accurately model an increase in bias.)

By convenction, we often use the regularization parameter $C$, the inverse of $\lambda$. The relationship between the two parameters is straightforward:

$$ C = \frac{1}{\lambda}$$

Multiplying through the above cost function by $C$ gives us a slightly altered version:

$$ J(w) := C \bigg [ J(w) \bigg ] + \frac{1}{2} \mid \mid x \mid \mid^{2}$$

which expands out to

$$ J(w) := C \bigg [ \sum_{i=1}^{n} -y^{(i)}log(\phi(z^{(i)})) - (1 - y^{(i)})log(1 - \phi(z^{(i)})) \bigg ] + \frac{1}{2} \mid \mid x \mid \mid^{2}$$

Since $C$ is the inverse of $\lambda$, large values of $C$ correspond to low strength regularization, while small values of $C$ correspond to high strength regularization.

Support Vector Machines

The Support Vector Machine (SVM) algorithm is an extension of the perceptron that is very common in classification problems. The major difference is that where the perceptron minimizes misclassification errors, SVM maximizes the margin between the decision boundary and the support vectors, the samples closest to the decision boundary.

Raschka's section on SVM is pretty sparse, so the following notes are heavily supported by Chapter 9 in G. James et. al.'s Introduction to Statistical Learning.

How do we define the margin mathematically? To start, recall that a decision surface is defined as a linear combination that fulfills the two requirements:

$$ \bigg\{ \begin{array}{ll} w^{T}x_{i} \gt 0 & if y_{i} = 1 \\ w^{T}x_{i} \lt 0 & if y_{i} = -1 \end{array} $$

Since $y_{i}$ can only be $1$ or $-1$, we can turn this piecewise definition into a singular one:

$$ y_{i} (w^{T}x_{i}) \gt 0 $$

With a margin, $M$, we can easily rewrite this definition in terms of $M$:

$$ y_{i} (w^{T}x_{i}) \ge M $$

Our goal will be to find $w$ that maximizes $M$ subject to the above constraint. According to Raschka, the optimization algorithm is too complex for this book, so we just have to trust that it can be done for now.

Slack variables

Slack variables are an optional addition to the SVM objective function that help improve performance for nonlinearly-separable data. With slack variables, we can allow for specific classification errors in samples that are really close to our decision boundary. We call SVM soft-margin classification when we use slack variables, since they allow us to occasionally "override" the margin, and hard-margin classification otherwise.

Recall the definition of the decision surface and its margin:

$$ y_{i} (w^{T}x_{i}) \ge M $$

Now, we'll add the slack variable vector $\epsilon$ to make the margin looser:

$$ y_{i} (w^{T}x_{i}) \ge M(1 - \epsilon_{i}) $$

Where $\epsilon$ is subject to the following constraints:

$$ \begin{array}{rr} \epsilon_{i} \ge 0, \\ \sum_{i}^{n} \epsilon_{i} \le C \end{array} $$

Where the parameter $C$ is the miscclassification penalty (also known as the budget) that controls the width of the margin. Since the values of the vector $\epsilon$ are constrained by $C$, as $C$ approaches $0$ we approach hard-margin SVM, as the effect of the slack variables becomes negligible. $C$ affects the bias/variance tradeoff of an SVM model: when $C$ is large, we'll tend to see consistent errors due to a wide margin (high bias); when $C$ is small, we'll tend to see a very limited number of support vectors determining the margin (high variance).

As a quick aside, Raschka defines the following objective function as the one that SVM with slack variables must minimize:

$$ \frac{1}{2} \mid\mid x \mid\mid^{2} + C \bigg ( \sum_{i=1}^{n} \xi^{(i)} \bigg ) $$

But since we don't have any context for how to minimize this function, I don't see how it's much more help.

What are the tradeoffs between SVM and logistic regression?

SVM is less sensitive to outliers than logistic regression, since it only considers errors in the points within the margin (whereas logistic regression calculates residuals across the entire dataset).

On the other hand, logistic regression is a simpler model, and it's easier to implement. It's also a lot easier to update, which makes it good for online learning (streaming data).

Kernel methods

Kernel methods describe a set of techniques for separating nonlinearly separable classes while still using a fundamentally linear model. The kernel refers to a similarity function which returns scores that quantify how similar two samples are.

Intution

There are three basic steps to any kernel-based learning algorithm:

Project samples into a higher-dimensional space where they are linearly separable
Find the hyperplane that separates the samples through a learning algorithm (SVM, logistic regression, etc.)
Project the samples and hyperplane back into the original dimension

Math

In a kernelized algorithm, we'l move dimensions via a mapping function $\phi(\cdot)$. Raschka gives an example of one possible mapping function that moves from two dimensions to three:

$$ \phi(x_{1}, x_{2}) = (z_{1}, z_{2}, z_{3}) = (x_{1}, x_{2}, x_{1}^{2} + x_{2}^{2}) $$

Then, we'll use the inverse function $\phi^{-1}(\cdot)$ to move back to two dimensions.

Since projection back and forth between dimensions can be expensive when updating weights across an entire dataset, we'll instead define a kernel function that measures the distance between samples. Raschka defines the Radial Basis Function (RDF – also known as the Gaussian kernel):

$$ k(x^{(i)}, x^{(j)}) = exp \bigg( -\gamma \mid\mid x^{(i)} - x^{(j)} \mid\mid^{2} \bigg) $$

Where $x^{(i)}, x^{(j)}$ are two samples in our dataset, $exp$ sets the interval of the score to $[0, 1]$, $\mid\mid x^{(i)} - x^{(j)} \mid\mid^{2}$ is the distance function between the two samples (which becomes a similarity function when prepended with a negative sign), and $\gamma$ is the "cutoff parameter" that defines how "loose" the fit of the boundary is. When $\gamma$ is large, we'll get a "tight" fit and the hyperplane will closely hew to the training samples; when $\gamma$ is small, we'll get a "loose" fit that will instead define rough regions in which the training samples reside.

Decision trees

Up until this point, we've looked exclusively at learning algorithms that fit into the "neural" model established by the perceptron and Adaline algorithms. Decision trees, however, are a common family of classifiers that use a different learning architecture.

On an intuitive level, decision trees define a set of logical operations to perform in stepwise fashion, with the goal of dividing the sample space into rule-based classification regions. For example, when learning on the famous Iris dataset, a decision tree might "ask questions" (i.e. perform boolean functions) to procedurally partition the sample space, like:

Is the petal length $\ge$ 2.5cm?
Is the sepal length $\le$ 5cm?

Hence, the algorithm defines a "tree" of logic that acts as a function to return class labels.

Decision trees are particularly useful because they're easily interpretable! They learn simple rules for dividing the sample space that can be communicated clearly to humans.

Maximizing information gain

Up until now, most objective functions we've looked at have been some form of cost (or error) functions, with a basic goal that we could express in prose as "determine how 'wrong' the model is based on one set of criteria, and then try to minimize that wrongness." Decision trees, on the other hand, introduce a radically new family of objective functions that aim to maximize the information gain of sample space partitions.

Raschka defines the information gain of a partition like so:

$$ IG(D_{p}, f) := I(D_{p}) - \sum_{j=1}^{m} \frac{N_{j}}{N_{p}} I(D_{j}) $$

Where:

$D_{p}$ is the parent node (the initial state of a region of the dataset, up to and including the sample space itself, that is set to be partitioned)
$f$ is the partition under consideration that splits up the parent node
$D_{j}$ is the $j$th child node
$I$ is the impurity measure that we use to quantify how much information is contained in a partition
$N$ is the number of samples in either the parent or child nodes

In prose, we measure the information gain by finding the difference between the impurity of the parent and the cumulative impurity of the child nodes. If the child nodes are cumulatively less impure (i.e. more pure) than the parent node, we've gained information (and vice versa).

For simplicity, we'll often use binary partitions (trees) of the parent nodes:

$$ IG(D_{p}, f) = I(D_{p}) - \bigg [ \frac{N_{left}}{N_{p}} (D_{left}) + \frac{N_{right}}{N_{p}} (D_{right}) \bigg ] $$

This means that each fork has at most two paths, and is computationally convenient.

So how do we determine the impurity of any given partition? For that, we turn to the three algorithms that Raschka presents:

Entropy ($I_{H}$)
Gini impurity ($I_{G}$)
Classification error ($I_{E}$)

Entropy

We define entropy with the following formula. For all non-empty classes (classes that contain samples in the training data):

$$ I_{H}(t) = - \sum_{i=1}^{c} p(i \mid t) \; log_{2} p(i \mid t) $$

Where

$t$ is one a given node
$i$ marks the $i$th class
$c$ is the total set of classes
$p(i \mid t)$ is the proportion of samples of class $i$ in node $t$ (equivalent to the probability that a sample is of the class $i$)

Notice that when all samples in a node belong to the same class, the impurity is minimized at $0$:

$$ I_{H}(t) \; = \; - \sum_{i=1}^{c} 1 \cdot log_{2}(1) \; = \; -(c - i)(0) \; = \; 0$$

And when non-empty classes are distributed uniformly among samples, the impurity is maximized at $1$:

$$ I_{H}(t) \; = \; -2 \; \big( \; 0.5 \cdot log_{2}(0.5) \; \big)^{2} \; = \; -2 (0.5 \cdot -1) \; = \; 1 $$

Gini impurity

Intuitively, the goal of Gini impurity is to minimize the probability of misclassification. As such, we define the impurity measure like so:

$$ I_{G}(t) \; = \; \sum_{i=1}^{c} p(i \mid t) \big(\; 1 - p(i \mid t) \;\big) \; = \; 1 - \sum_{i=1}^{c} p(i \mid t)^{2} $$

with the same variable definitions as in entropy.

Again, the impurity measure is minimized at 0 when the classes are perfectly separated, e.g.:

$$ 1 - (1^{2} + 0^{2} + 0^{2} + ... + 0^{2}) = 0 $$

and the measure is maximized at 0.5 when the classes are uniformly distributed among samples:

$$ 1 - (0.5^{2} + 0.5^{2}) = 0.5 $$

Usually, we get a pretty similar result using Gini and entropy as impurity indeces. Raschka advises against trying them both, saying it's not a good use of time.

Classification error

In contrast to entropy and Gini impurity, classification error only considers the class with the highest proportion of samples:

$$ I_{E}(t) = 1 - max\{p(i \mid t)\} $$

The relative sensitivity of the impurity measures can be expressed as a hierarchy, from most sensitive to least:

Entropy
Gini impurity
Classification error

Raschka describes Gini impurity as a kind of "compromise" between Gini impurity and classification error.

Random forests

Random forests are a generalization of decision trees that have some nice properties:

High performing
Scalable
Easy to use
Good at handling noisy datasets

With a downside of being slightly harder to interpret.

Intuition

In short, random forest is an ensemble of decision trees: by building a set of weak learners on random subsets of the samples and features in the dataset, we can combine them into a strong learner that can classify new samples by "majority vote" among the weak learners.

The algorithm is straightforward:

Draw $n$ "bootstrap" samples with replacement from the sample space
Grow a decision tree! For each node:
1. Draw a subset of $d$ features without replacement
2. Split the node on the feature that maximizes the objective function (e.g. Information Gain)
Repeat steps 1 and 2 $k$ times
When classifying new data, assign class labels by majority vote among the ensemble

Here, you can see that the weak learners are each built on a random sample of samples and features from the original dataset – hence the "randomness" of the forest.

Hyperparameters

Random forests have very few hyperparameters. All we have to decide is:

$n$, the number of bootstrap samples (influences bias/variance tradeoff)
$d$, the number of features to sample (influences complexity of the algorithm)
$k$, the number of iterations

Unlike in the case of a single decision tree, we don't have to do much "pruning" of a random forest – the random sampling takes care of that for us.

K-Nearest Neighbors

K-Nearest Neighbors (KNN) is the last algorithm Raschka presents in Chapter 3... phew!

KNN is a lazy learner, which means it doesn't learn a decision function – instead, it memorizes the training data and classifies new samples based on the distance to training samples.

Since KNN doesn't learn parameters for a decision function, we can also call it a non-parametric model. Here's how some of the models we've learned might be classified in terms of parametric and non-parametric:

Parametric models we've learned

Perceptron
Adaline
Logistic regression

Non-parametric models we've learned

Decision trees
Random forests
KNN (soon!)

The distinction between parametric/non-parametric models is a little bit nebulous. Kernel SVM, for example, does learn parameters, but we often don't care about them – we're only interested in the mapping function.

KNN Intuition

The KNN algorithm is very simple:

Choose a number of relevant neighbors, $k$, and a distance metric $d$ to compare them
For each new sample:
1. Find the $k$ closest samples according to the metric $d$
2. Assign a class label by majority vote among the samples
  - In case of even $k$ and a tie vote, decrease $k$ by 1

Distance functions

There's a whole world of distance functions we can use in KNN. Raschka uses Minkowski distance, which is a generalization of the Euclidean and Manhattan distances:

$$ d(x^{(i)}, x^{(j)}) = \sqrt[p]{\sum_{k} \mid x_{k}^{(i)} - x_{k}^{(j)} \mid^{p}} $$

When $p = 2$, the Minkowski distance is equivalent to Euclidean distance; when $p = 1$, it's equivalent to Manhattan distance.

The curse of dimensionality

As $n \rightarrow \infty$, the feature space can become quite sparse, making it hard to find neighbors. Hence, feature selection and dimensionality reduction are important for KNN.

Luckily, we'll be learning all about that... in the next lesson!