Goals:
Understand what is meant by the term "machine learning", and learn the rudiments of the language that goes with it.
See how some of the methods of basic machine learning, and the underlying philosophy, relate to the statistical inference we have been studying in the rest of the course.
Examples of unsupervised learning activities include:
Typically a supervised learning model is "trained" on a subset of the data, and then its ability to make predictions about new data "tested" on the remainder.
Training involves "fitting" the model to the data, optimizing its parameters to minimize some "loss function" (or equivalently, maximize some defined "score").
Examples of data-driven, non-parametric models for use in supervised learning include K-nearest neighbors, Support Vector Machines, Random Forest, Neural Networks, and many more.
Many can be used for either classification or regression.
All have a number of hyper-parameters that govern their overall behavior, that need to be determined for any given dataset.
The
scikit-learn
algorithm cheatsheet, as provided with the package documentation.
In supervised machine learning the goal is to make the most accurate predictions we can - which means neither over-fitting nor under-fitting the data
The "mean squared error" between the model predictions and the truth is a useful metric: minimizing MSE corresponds to minimizing the "empirical risk," defined as the mean value loss function averaged over the available data samples, where the loss function is quadratic
$\;\;\;\;\;{\rm MSE} = \mathcal{E} \left[ (\hat{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y} + \bar{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y})^2 \right] + (\bar{y} - y^{\rm true})^2$
$\;\;\;\;\;\;\;\;\;\;\;\;\; = {\rm var}(\hat{y}) + {\rm bias}^2(\hat{y})$
In general, different models reach different balances between the variance and bias of their predictions
A particular choice of loss function leads to a corresponding minimized risk
With a single training/test split, one can characterize the prediction error using, for example, the MSE.
The model that minimizes the generalized prediction error can be found (approximately) with cross validation, in which we consider multiple training/test splits, and look at the mean prediction error across all of these "folds."
How we design the folds matters: we want each subset of the data to be a fair sample of the whole.
Another layer of cross validation is still needed, since we need to guard against over-fitting to this particular training set: we need to try all possible training sets.
Once we have the hyperparameters that optimize the generalized prediction error, we can then fix them at their optimal values and train on model on the entire data set.
Question: when we use our machine to predict the target variables (labels) in a new dataset, what key assumption are we making? How do we test it?
Talk to your neighbor for a few minutes about the things you have just heard about machine learning. Be prepared to discuss the following questions:
In this course so far have we been doing supervised or unsupervised learning problems?
Have we been talking about regression or classification problems?
Does machine learning sound Bayesian, Frequentist, or neither, to you?
What do you make of the emphasis on prediction accuracy in machine learning? How relevant is it to science?
In the Machine Learning tutorial you will use the scikit-learn
python package to try out some machine learning models, and see how generalized prediction accuracy is estimated during a cross-validation analysis of the model hyper-parameters.