Author: Dr. Robert Lyon
Contact: robert.lyon@manchester.ac.uk
Institution: University of Manchester
Affiliation: SKA Group, Time Domain Team (TDT)
Version: 1.0
Date: 11/04/2018
This notebook explores introduces the basic concepts underpinning machine learning classification. The content presented here was originally written to support a talk I delivered at European Week of Astronomy and Space Science (EWASS) meeting in 2018. The talk is entitled:
Imbalanced Learning In Astronomy
Note that this notebook requires Python 3.6.
The code and the contents of this notebook are released under the GNU GENERAL PUBLIC LICENSE, Version 3, 29 June 2007. The images are exempt from this, as some are used in publications I've written in the past. If you'd like to use the images please let me know, and I'll sort something out.
I kindly request that if you make use of the notebook, please cite the work using the following $\textrm{Bibtex}$ source.
We use instruments to measure the world around us. For example we can use a thermostat to measure temperature, a questionnaire to measure opinions or record details, and ground-based telescopes to measure electromagnetic radiation arriving at the Earth's surface. In each case, measurements yield data.
When we take a measurement, we're making an observation. Observational data can be numerical, textual, categorical, or a combination thereof. In the case of a thermostat, it takes temperature measurements at a specific point in time, producing a single numerical value. If the temperature within a room drops below a pre-determined level (i.e. becomes too cold), the measurements from the thermostat can be used to automatically turn a heating system on to compensate. The performance of the system depends on the thermostat providing accurate measurements. Otherwise a room or building could become too hot, or too cold.
There are many factors that contribute to cooling or heating an arbitrary room. These include the current state of a potential heating system (on/off), the temperature outside, the time of day, the time of year, or even the body heat emitted by people within the room. There are many variables to consider, and these are but a few.
Usually we don't care about all the details - the room is either too hot, or too cold. But what about in situations where heat management is critically important? Say, in the heart of a computer server room or nuclear reactor. Here temperature changes can cause problems. In these scenarios is it generally desirable to keep equipment cool for optimal performance. If the temperature increases too much, it may become necessary to cool things down. Perhaps by increasing the airflow, using air conditioning/cooling liquids, or even turning some machines/the reactor off. Such actions have safety implications and a financial cost. Ideally we'd like to take pre-emptive action to stop any adverse conditions arising. Perhaps by making predictions.
Making such predictions is hard! There are many variables to consider. To proceed we usually make a set of assumptions regarding which variables matter most, and focus on those. So for example, let's not worry about the heat added to the server room by people. This is the best we can do with imperfect knowledge. Ideally we'd like to fully understand the processes at play in these scenarios. But what does "understand the process" actually mean?
We can quantify this mathematically via functions. A function is just an input/output box. It accepts input in the form of data, and produces data as output via a mathematical formula. We can use functions to model very complicated scenarios more easily.
Example
Consider the server room example. There are a large number of factors determining the room's temperature at any given moment (outside temperature, airflow, number of computers, computational load, time of day etc.). We can't keep track of all these variables - that would be incredibly impractical. Yet we can think of the room as an input/output box that accepts all these variables as inputs, and yields the temperature as output.
In nature there exists a mathematical formula for this function. Given all (potentially infinite) input variables, it would return the exact temperature that results, without error. This function can be represented via:
Where $x$ represents the input. In many cases we'd like to obtain this function. For the server room it would allow us to predict the state of the room now, and anytime in the future with complete certainty - for any set of input variables. This predictive power could be used to great effect. The cost savings achievable in terms of power alone would be huge - decreased power consumption, increased equipment life-time, fewer hardware failures etc.
However to find $f(x)$ we require a dataset accurately describing the infinite variety of input variables and output states. This is clearly infeasible!
Instead we can observe a server room over a fixed period of time - perhaps we can use that information? Experts collect monitoring data over weeks, months and sometimes even years. The data provides a detailed picture of how input variables map to output states. As more data is collected, the picture becomes clearer.
Suppose a server room is monitored from January to March. The data collected describes a subset of the inputs and outputs of $f(x)$. It doesn't cover the summer months, so we don't have data associated with higher summer temperatures. The data we have is therefore incomplete compared to what's needed to produce $f(x)$.
Instead we have a data sample that describes a simplified version of the server room, who’s input and output states are produced by the closely related function:
This function represents an approximate view of $f(x)$, in this case from January to March. It is important to remember that it is obtained using technological observing instruments and tools. Such human made devices are imperfect and inadvertently introduce errors/noise (e.g. system noise, human-induced bias etc.). Thus the approximation is imperfect in many ways.
Why relevant to Machine Learning?
Suppose we want to make predictions using the monitoring data collected in the server room. The data consists of input variables from a set $X$, and a set of resulting output states from a set $Y$. For example, if $x=$CPU temperature greater than 90 degrees, then $y=$ overheating.
The goal of machine learning classification (also known as pattern recognition) is to perform the mapping from $X$ to $Y$ on new, unseen data; i.e. data not described by $f(x)^{\prime}$. This is depicted in a summary diagram below.
In other words, can we use the data from January to March to predict for the rest of the year? Machine learning classifiers can achieve this. Principally, by computing a mathematical function that automatically performs the mapping from $X$ to $Y$, as accurately as possible. The computed generic classification function can be denoted as:
Yet this function is imperfect too! In fact, it has more imperfections than $f(x)^{\prime}$. This is because it's made additional assumptions inorder to perform the mapping from $X$ to $Y$. To illustrate this, suppose that when the server room was observed, the external temperature never rose above $11^{\circ}c$. Therefore the room's air conditioning was never observed under the strain of warm summer months. Since $f(x)^{\prime}$ doesn't currently capture this temporal pattern, neither can $f(x)^{\prime\prime}$. The classification function will thus likely produce unhelpful predictions during the summer months.
What if our experts take some new observations during the summer months - but forget to account for the time of day as an input variable? Here $f(x)^{\prime}$ is certainly more complete, yet it fails to capture the temperature fluctuations that occur during a normal day. But it's usually cooler in the morning and warmer in the afternoon when most people are working. This is important, since a server room will be under most load during the working day. As $f(x)^{\prime}$ still doesn't capture this, neither will $f(x)^{\prime\prime}$. Once again, this leads to unreliable predictions.
There are some important lessons here:
Taking all these functions together we can paint a simple picture. There is a function describing the perfect underlying data. Then a function describing the data we actually collect - this is what we begin to do science with. Then there's the function describing the interpretations we make via ML algorithms and other methods - this includes errors collected when observing, plus errors we may make during analysis.
So far "computing a function" is an abstract concept. It raises questions - how is this achieved? Is a heuristic used? How is it useful?
It's perhaps best illustrated via a trivial example. Consider the mapping problem in the image below. Here $X$ contains a set of heating configurations that consume power levels defined in $Y$. On the left-hand side we can see that some examples in $X$ have been linked to levels in $Y$. For example an input level of 4 corresponds to an output level of 16 (units unimportant). These items in $X$ have been mapped to items in $Y$. In other words these have been labelled (the ground truth mapping is known).
The function that does the mapping is missing on the left-hand side. How might one be found? Well, you search for one...
Machine learning algorithms are fundamentally optimisation functions. These are functions that search for an optimal set of parameters that minimise or maximise some objective function. Where the objective is to make as many correct predictions as possible (or the inverse). This can be easily quantified via a count of the errors made (though more complex in practice!). Such an objective function is given below:
\begin{eqnarray} \sum_{i=1}^{n} \delta[f(x_{i})^{\prime\prime}\neq y_{i}] \textrm{,} \end{eqnarray}where $\delta[i]$ an indicator function that returns 1 if the condition within it evaluates to true, else 0. So here if the prediction of $f(x_{i})^{\prime\prime}$ doesn't match with the known label $y_{i}$, $\delta$ returns 1.
Lets use that function to optimise a contrived classifier,
\begin{eqnarray} f(x_{i})^{\prime\prime} = x^{a} \textrm{,} \end{eqnarray}where $a$ is the parameter to be optimised.
On the right we see two attempts at finding a function mapping $X$ to $Y$. The first uses $a=1.5$, which produces 5 errors and 0 correct predictions. By setting $a=2$ we find an optimal solution, which achieves 0 errors and 5 correct predictions. Typically there would be steps in between during a parameter search e.g. $a=1.6$, $a=1.61$,..., $a=1.71$, $a=1.7$, ... found via e.g. gradient descent or some other method.
Here the search is easy as the problem is so simple. Yet at a fundamental level this is how machine learning classifiers work. Except the functions have multiple terms to be optimised according to the type of classifier used. Different classifiers use different objective functions, and different methods for parameter search. There are thousands of potential classifiers.
Now we move onto a formal computer science description of machine learning.
ML is a branch of artificial intelligence (A.I.), concerned with replicating and improving upon the human ability to learn. ML is comprised of multiple sub-fields, each supported by extremely active research communities. Here we are concerned with applying the tools from a specific area of ML known as pattern recognition - also known as statistical classification. The goal of classification is to automatically categorise items, objects, and data points, as accurately as possible. Human beings are capable of undertaking sophisticated classification tasks with ease, given appropriate training. This is due to our innate ability to categorise via trial and error. ML algorithms learn in a similar way, however using statistical tools (see Bishop 2006, and Lyon 2016 for a review).
The aim of classification is to build functions that accurately map a set of input data points, to a set of class labels. This means mapping each input example to it's correct label. If $X=\lbrace x_{\rm 1}, \ldots , x_{\rm n} \rbrace $ represents an input dataset, then $x_{\rm i}$ is an individual example represented by variables known as features. Features describe the characteristics of an example such that $x_{\rm i} = \lbrace x_{\rm i}^{\rm j},...,x_{\rm i}^{\rm m} \rbrace$, where each feature $x_{\rm i}^{\rm j} \in \mathbb{R}$ for $j=1, \ldots , m$.
The class label $y$ is also associated with each example. This can take many possible values so long as $y \in Y$. The class label can be numerical or textual. Though typically each $y$ is an integer value correpsonding to some class category where $y \in Y=\lbrace y_{\rm 1}, \ldots, y_{\rm k} \rbrace$. It is common for only binary labels to be considered such that $y\in Y=\lbrace -1,1 \rbrace$. Where $y_{\rm 1}=-1$ equates to the non-target class (synonymous with negative) and $y_{\rm 2}=1$ to target class (synonymous with positive). In the pulsar domain, we consider the pulsar class to be the target (positive), and all others (noise, interference) to be non-target (negative)
The features used to describe examples must be chosen after careful consideration. Features must be useful for separating the classes under consideration. For the binary two-class case, assuming only positive and negative classes, the feature distributions for examples with the label $y_{\rm 1}=-1$ must differ from the distributions for examples with the label $y_{\rm 1}=1$. These distributions must be separable, which can be quantified mathematically, i.e. via distance measures. Feature separability is depicted in the image below according to individual features. Here $\phi$ depicts an optimal separation between the classes.
An ML function 'learns' to separate examples from different classes, using the feature data from a labelled input vector known as the training set $T_{train}$. It contains pairs such that $T_{train}=\lbrace (x_{\rm 1},y_{\rm 1}),\ldots,(x_{\rm n},y_{\rm n})\rbrace$.
Think of the training set as the information available to a student prior to an examination. This information must be descriptive, (i.e. the features must contain useful information), and reflect what the topic of the exam. A classifier induces a mapping function between examples and labels based on the data in $T_{train}$. It does this by attempting to minimise the mapping errors it makes on the training examples. This process is known as 'training'.
The trained function is used to label new unseen candidates in the 'test set' $T_{test}$. The test should be a sample of independent examples used to test the trained classifier, or the real world data the classifier is asked to classify.
The machine learning process is summarised in the diagram below. The process begins with the collection of data. The data sample is initially studied by human experts, and labelled as accurately as possible (human labelling error is possible). Once the labelled data is available, it is re-sampled into distinct subsets. These include,
Next a classification model is chosen as a candidate solution, after studying available/relevant options. The classification model is initially 'taught' using the training data. This is known as the 'training' phase. Once training is complete the 'testing' phase begins. Here data from the test set are passed through the classification model, and the predictions collected. The predictions are evaluated against the ground truth labels we already have from the data collection phase. If the model performs well, then the system can be deployed. If not, we return to the training phase. Either with the same classifier, or a new one. The entire process is repeated until good performance is achieved. Whilst this is a simplified picture, it should suffice for this work.
ML classifiers are simply functions. The output of functions can be visualised, making them easier to understand. Earlier in the background section I introduced the contrived classifier,
\begin{eqnarray} f(x_{i})^{\prime\prime} = x^{a} \textrm{.} \end{eqnarray}We found that $a=2$ was the optimal parameter choice, and we now visualise the function output below.
We can see that the function produces a simple parabola. This isn't very useful for classification! Yet visualising a classification function's output in the presence of data, allows us to evaluate how well it works. Suppose there's a classifier described via a simple linear function. It produces a single line through our data as shown in the image below.
This dashed line is more commonly known as a 'decision boundary'. Here the data are partitioned by the boundary. Data points falling on the left hand side are assigned the positive label, those on the right the negative label. Depending on the true class of each example and where they fall w.r.t the decision boundary, there are four possible outcomes in a binary classification problem. These are summarised in the Table below which is known as a confusion matrix.
| Predicted Label | Predicted Label | ||
|---|---|---|---|
| Negative | Positive | ||
| True Label | Negative | True Negative (TN) | False Positive (FP) |
| True Label | Positive | False Negative (FN) | True Positive (TP) |
From these outcomes we can quantify how well a classifier performs. We do this using some common metrics such as accuracy, precision, and recall. These are described in the Table below.
| Metric | Description | Definition |
|---|---|---|
| Accuracy | Measure of overall classification accuracy. | $\frac{(TP+TN)}{(TP+FP+FN+TN)}$ |
| False positive rate (FPR) | Fraction of negative instances incorrectly labelled positive. | $\frac{FP}{(FP +TN)}$ |
| Precision | Fraction of retrieved instances that are positive. | $\frac{TP}{(TP+FP)}$ |
| Recall | Fraction of positive instances that are retrieved. | $\frac{TP}{(TP + FN)}$ |
| Specificity | Fraction of negatives correctly identified as such. | $ \frac{TN}{(FP + TN)}$ |
| F-Score | Measure of accuracy that considers both precision and recall. | $2\times{ \frac{precision\times{recall}} {precision + recall}} $ |
Ideally we'd like a classifier that produces high levels of recall and accuracy, whilst simultaneously minimising the false positive rate. This becomes especially important as our data becomes increasingly complex. Complex data usually require more sophisticated decision boundaries to attain a clear class separation. Even in the trivial example shown in the image below, we can see how a perfect boundary might look. This greatly improves on the simple linear separator, but is more difficult to compute.
That's it for now. More detailed information can be found elsewhere, but I hope it helps.