Machine Learning

Goals:

Understand what is meant by the term "machine learning", and learn the rudiments of the language that goes with it.
See how some of the methods of basic machine learning, and the underlying philosophy, relate to the statistical inference we have been studying in the rest of the course.

What is Machine Learning ?

The umbrella term "machine learning" describes methods for automated data analysis, developed by computer scientists and statisticians in response to the appearance of ever larger datasets.

The goal of automation has led to an emphasis on non-parametric models (that adapt to dataset size and complexity), and a very uniform terminology that enables multiple models to be implemented and compared on an equal footing.

Machine learning can be divided into two types: supervised and unsupervised.

Supervised Learning

Supervised learning is also known as predictive learning. Given inputs $X$, the goal is to construct a machine that can accurately predict a set of outputs $y$, usually so that decisions can be made.

The "supervision" refers to the education of the machine, via a training set $D$ of input-output pairs that we provide. Prediction accuracy is then tested on validation and test sets.

Supervised Learning

At the heart of the prediction machine is a model $M$ that can be trained to give accurate predictions.

Supervised learning is about making predictions by characterizing ${\rm Pr}(y_k|x_k,D,M)$.

Supervised Learning

The outputs $y$ are said to be response variables - predictions of $y$ will be generated by our model.
The variables $y$ can be either categorical ("labels") or nominal (real numbers).

Supervised Learning

When the $y$ are categorical, the problem is one of classification ("is this an image of a kitten, or a puppy?").

When the $y$ are numerical, the problem is a regression ("how should we interpolate between these numerical values?").

Unsupervised Learning

Also known as descriptive learning. Here the goal is "knowledge discovery" - detection of patterns in a dataset, that can then be used in supervised/model-based analyses.

Unsupervised learning is about density estimation - characterizing ${\rm Pr}(x|\theta,H)$.

Unsupervised Learning

Examples of unsupervised learning activities include:
- Clustering analysis of the $x$.
- Dimensionality reduction: principal component analysis, independent component analysis, etc.

In this lesson we will focus on supervised learning. In the tutorial you'll work through some simple applications in astronomy.

Data Representations

Each input $x$ is said to have $P$ features (or attributes), and represents a sample (assumed to have been drawn from a sampling distribution). Each sample input $x$ is associated with an output $y$.

Our $N$ input samples are packaged into an $N \times P$ design matrix $X$ (with $N$ rows and $P$ columns).

Typically a supervised learning model is "trained" on a subset of the data, and then its ability to make predictions about new data "tested" on the remainder.
Training involves "fitting" the model to the data, optimizing its parameters to minimize some "loss function" (or equivalently, maximize some defined "score").

Machine Learning Models

Examples of data-driven, non-parametric models for use in supervised learning include K-nearest neighbors, Support Vector Machines, Random Forest, Neural Networks, and many more.

Many can be used for either classification or regression.

All have a number of hyper-parameters that govern their overall behavior, that need to be determined for any given dataset.

The scikit-learn algorithm cheatsheet, as provided with the package documentation.

Optimizing Model Prediction Accuracy

In supervised machine learning the goal is to make the most accurate predictions we can - which means neither over-fitting nor under-fitting the data
The "mean squared error" between the model predictions and the truth is a useful metric: minimizing MSE corresponds to minimizing the "empirical risk," defined as the mean value loss function averaged over the available data samples, where the loss function is quadratic

$\;\;\;\;\;{\rm MSE} = \mathcal{E} \left[ (\hat{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y} + \bar{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y})^2 \right] + (\bar{y} - y^{\rm true})^2$

$\;\;\;\;\;\;\;\;\;\;\;\;\; = {\rm var}(\hat{y}) + {\rm bias}^2(\hat{y})$

In general, different models reach different balances between the variance and bias of their predictions
A particular choice of loss function leads to a corresponding minimized risk

With a single training/test split, one can characterize the prediction error using, for example, the MSE.
The model that minimizes the generalized prediction error can be found (approximately) with cross validation, in which we consider multiple training/test splits, and look at the mean prediction error across all of these "folds."
How we design the folds matters: we want each subset of the data to be a fair sample of the whole.

Another layer of cross validation is still needed, since we need to guard against over-fitting to this particular training set: we need to try all possible training sets.
Once we have the hyperparameters that optimize the generalized prediction error, we can then fix them at their optimal values and train on model on the entire data set.

Question: when we use our machine to predict the target variables (labels) in a new dataset, what key assumption are we making? How do we test it?

Exercise

Talk to your neighbor for a few minutes about the things you have just heard about machine learning. Be prepared to discuss the following questions:

In this course so far have we been doing supervised or unsupervised learning problems?
Have we been talking about regression or classification problems?
Does machine learning sound Bayesian, Frequentist, or neither, to you?
What do you make of the emphasis on prediction accuracy in machine learning? How relevant is it to science?

Endnote

Machine learning algorithms are designed to make good use of big, complex datasets, where there are likely to be many more correlations and connections than we have thought of yet.

In this approach we assume that we will be able to make better predictions by using flexible, "non-parametric" methods that scale with the size of the dataset and allow new relationships to emerge empirically

Additional work needs to be done to extract a full Bayesian posterior PDF (or even frequentist confidence intervals) for the model parameters - which are typically not the focus of a machine learning analysis.

Tutorial

In the Machine Learning tutorial you will use the scikit-learn python package to try out some machine learning models, and see how generalized prediction accuracy is estimated during a cross-validation analysis of the model hyper-parameters.

Machine Learning

Further Reading

What is Machine Learning ?

Supervised Learning

Supervised Learning

Supervised Learning

Supervised Learning

Unsupervised Learning

Unsupervised Learning

Data Representations

Machine Learning Models

Optimizing Model Prediction Accuracy

Exercise

Endnote

Tutorial