Machine Learning cheat sheet

This is a short cheat sheet for doing machine learning experiments. It'll discuss data exploration the modelling pipeline and model validation in short with links to external resources for reference

Data Exploration

When modelling with machine learning it's easy to just present your data to a model, while this has the odd chance of working more likely this will give you a very weak model.

The first step of the machine learning process is then to look at your data, and your chosen representation of the data.

Visualization

Depending on your data, if it's sequential, real-continious or image based your data should be visualized to get a reference for the distribution of data. In particular for representations of continious or discreet variables visualizing ditributions can be hugely helpful to determining the style of normalization to apply. In short: look at your data.

Normalizing

After inspecting your data distribution(s) you should consider what standardization techniques to apply. In particular you need to think about if you need the covariance matrix of your data to be unchanged under the normalization. Mean centering doesn't change your covariance matrix, standardization of the variance does.

Also consider if your features are on the same scale. Your model might be in trouble if one feature is on the order $~10^0$ and a second is on the order $~10^3$.

Prototyping

It's likely that the first version of your chosen model, or even that the model type or representation might provide unsuitable for the problem at hand. A fundamental part of the process then is exploring representations, and model types that might be suitable. The keras API and scikit-learn off-the-shelf algorithms are fantastic for this purpose.

Remember to keep your model assumptions in mind

Blogs about this:

Model Validation

For a regression problem, supervised classification or unsupervised classification with a set of ground truth labels it is absolutely necessary to split your data in disjoint sets for training and testing prior to any processing. The processing variables should be established from the training set and applied to test and training. This is to estimate if your normalization is sane for "unseen" data.

Hyperparameter tuning

For almost all interesting problems the computational cost of running a gridsearch to find the optimal configuration for the hyperparameters is not feasible, both because it is expensive, but the loss-surface might be glassy excepting one very narrow spike your grid doesn't hit. There are many sophisticated tools for the job, but for most cases the tool for the job has shown to be simple random search. Run N experiments with random hyperparameter configurations and pick the best performing on the validation set.

Performance estimation

For estimating your models performance the tool depends on your application. For regression the $R^2$ coefficient (explained variance) is commonly used, and for linear regression models you usually want an estimate of coefficient significance (scipy statsmodels is a python package good for this). For estimating significance and impact you should have a clear image of the degree of multi-colinearity in your data (does your design matrix have full rank?). For non-linear models of 1d vectors (not sequence or image models) estimating feature importance (the impact of removing one feature on model performance). For image data and convolutional nets visualizing the max activation of the filters is a good way to estimate what your model is doing.

For classification there usually is a tradeoff between number of true and false positives. The f1-score (or depending on your model you might favor precision over recall or vice versa which would need an adjusted f-score) is a good single number-measure. Plotting the ROC (reciever-operator characteristic) curve and estimating it's area under the curve (analogous to accuracy). These curves are plotted per-class basis, and then you can average over them to produce an aggregate performance.

You should also use some statsitical measure for model validation. k-fold cross validation is recommended for most cases.

In essence: what the goal of all this is to be able to estimate the generalization performance. The training set is used to estimate the optimal parameters and hyperparameters as well as the distribution of performance under these. For each hyperparameter search you use the test set to make a final estimation of your generalization performance, going back and adjusting parameters based on test performance strongly reduces your certainty of generalization and is strongly discouraged.

Scikit-learn has many off-the shelf measures for model performance and validation, with examples.

Blogs about this: