This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for UW's [Astro 599](http://www.astro.washington.edu/users/vanderplas/Astr599/) course. Source and license info is on [GitHub](https://github.com/jakevdp/2013_fall_ASTR599/).
This is an intermediate-level tutorial on Scikit-learn. It will assume that you're familiar with the following:
fit()
method for supervised and unsupervised learningpredict()
method for supervised learning (also clustering)transform()
method in unsupervised learningscore()
method in supervised and unsupervised learningIf you need to brush up on any of this, you can check out the notebooks associated with my Astr 599: Scientific Computing Course at University of Washington.
Here we will very briefly review these points, before getting into the main part of this tutorial, which focuses on validation and model selection.
Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.
Data in scikit-learn is represented as a feature matrix and a label vector
$$ {\rm feature~matrix:~~~} {\bf X}~=~\left[ \begin{matrix} x_{11} & x_{12} & \cdots & x_{1D}\\ x_{21} & x_{22} & \cdots & x_{2D}\\ x_{31} & x_{32} & \cdots & x_{3D}\\ \vdots & \vdots & \ddots & \vdots\\ \vdots & \vdots & \ddots & \vdots\\ x_{N1} & x_{N2} & \cdots & x_{ND}\\ \end{matrix} \right] $$$$ {\rm label~vector:~~~} {\bf y}~=~ [y_1, y_2, y_3, \cdots y_N] $$Here there are $N$ samples and $D$ features.
Several example datasets are available in the sklearn.datasets
module:
In [1]:
from sklearn import datasets
iris = datasets.load_iris()
print iris.keys()
In [2]:
X = iris.data
y = iris.target
In [3]:
print X.shape
print y.shape
We see that the iris data has 150 samples and 4 features: that is, observations of 4 measurements on each of 150 different iris flowers.
In code, we usually denote these numbers n_samples
and n_features
:
In [4]:
n_samples, n_features = X.shape
len(y) == n_samples
Out[4]:
In [5]:
print iris.feature_names
print len(iris.feature_names) == n_features
Features can be any uniformly measured numerical observation of the data. For example, in the digits data, the features are the brightness of each pixel:
In [6]:
digits = datasets.load_digits()
digits.data.shape
Out[6]:
In [7]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [8]:
fig, ax = plt.subplots(8, 12, subplot_kw={'xticks':[], 'yticks':[]})
for i in range(ax.size):
ax.flat[i].imshow(digits.data[i].reshape(8, 8),
cmap=plt.cm.binary)
Later, we'll see an example of using the digits data in a learning problem.
The algorithms of machine learning are generally split into two basic categories: Supervised and Unsupervised learning.
Supervised learning concerns labeled data, and the construction of models that can be used to predict labels for new, unlabeled data.
Example: given a set of labeled hand-written digits, create an algorithm which will predict the label of a new instance for which a label is not known.
Unsupervised learning concerns unlabeled data, and finding structure in the data such as clusters, important dimensions, etc.
Example: given a set of unlabeled digits, determine which digits are related.
In [9]:
from sklearn.naive_bayes import GaussianNB
X = digits.data
y = digits.target
# Instantiate the estimator
clf = GaussianNB()
# Fit the estimator to the data, leaving out the last five samples
clf.fit(X[:-5], y[:-5])
# Use the model to predict the last several labels
y_pred = clf.predict(X[-5:])
print y_pred
print y[-5:]
We see that this relatively simple model leads to a perfect classification of the last few digits!
Let's use the model to predict labes for the full dataset, and plot the confusion matrix, which is a convenient visual representation of how well the classifier performs:
In [10]:
from sklearn import metrics
clf = GaussianNB()
clf.fit(X, y)
y_pred = clf.predict(X)
def plot_confusion_matrix(y_pred, y):
plt.imshow(metrics.confusion_matrix(y, y_pred),
cmap=plt.cm.binary, interpolation='nearest')
plt.colorbar()
plt.xlabel('true value')
plt.ylabel('predicted value')
print "classification accuracy:", metrics.accuracy_score(y, y_pred)
plot_confusion_matrix(y, y_pred)
Interestingly, there is confusion between some values. In particular, the number 2 is often mistaken for the number 8 by this model! But for the vast majority of digits, we can see that the lassification looks correct.
Let's use the metrics
submodule again to print the accuracy of the classification:
In [11]:
print metrics.accuracy_score(y, y_pred)
We have an 82% accuracy rate with this particular model.
But there's a problem: we are testing the model on the data we used to train the model. As we'll see later, this is generally not a good approach to model validation! Because of the nature of the Naive Bayes estimator, it's alright in this case, but we'll see later examples where this approach causes problems.
Now let's say you'd want to visualize the relationship between the digits samples. There are 1700 samples in 64 dimensions: if we could simply draw a 64-dimensional scatter-plot, we'd be able to see the relationships. Unfortunately, most humans can't visualize more than 3 dimensions, so we have to project the data.
Let's try a random projection solution first:
In [12]:
np.random.seed(0)
proj = np.random.normal(size=(X.shape[1], 2))
X_rand = np.dot(X, proj)
fig, ax = plt.subplots()
im = ax.scatter(X_rand[:, 0], X_rand[:, 1], c=y)
fig.colorbar(im);
Let's now use an unsupervised learning approach to visualize this same data. We'll reduce the dimensionality of the data using Principal Component Analysis (PCA):
In [13]:
from sklearn.decomposition import PCA
# instantiate the model
model = PCA(n_components=2)
# fit the model: notice we don't pass the labels!
model.fit(X)
# transform the data to two dimensions
X_PCA = model.transform(X)
print "shape of result:", X_PCA.shape
# plot the results along with the labels
fig, ax = plt.subplots()
im = ax.scatter(X_PCA[:, 0], X_PCA[:, 1], c=y)
fig.colorbar(im);
We've reduced the dimensionality to two in a way that takes into account the structure of the data. By comparison, a random projection does not conserve this structure:
An important piece of the learning task is the measurement of prediction performance, also known as model validation. We'll go into detail about this, but first motivate the approach with an example.
Above we looked at a confusion matrix, which can be computed based on the results of any model. Let's look at another classification scheme here, the K-Neighbors Classifier
In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
digits = datasets.load_digits()
X, y = digits.data, digits.target
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
y_pred = clf.predict(X)
print "classification accuracy:", metrics.accuracy_score(y, y_pred)
plot_confusion_matrix(y, y_pred)
Our classifier gives perfect results! Have we settled on a perfect classification scheme?
No! The K-neighbors classifier is an example of an instance-based classifier, which memorizes the input data and compares any unknown sample to it. To accurately measure the performance, we need to use a separate validation set, which the model has not yet seen.
Scikit-learn contains utilities to split data into a training and validation set:
In [15]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print X_train.shape, X_test.shape
In [16]:
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "classification accuracy:", metrics.accuracy_score(y_test, y_pred)
plot_confusion_matrix(y_test, y_pred)
This gives us a more accurate indication of how well the model is performing.
For this reason you should always do a train/test split when validating a model.
Above, we used perhaps the most simple evaluation metric, the number of matches and mis-matches. But this is not always sufficient. For example, imagine you have a situation where you'd like to identify a rare class of event from within a large number of background sources (in my field, an example of this is finding variable stars from the background of non-varying stars). This "number of matches" metric can break down:
In [17]:
# Generate an un-balanced 2D dataset
np.random.seed(0)
X = np.vstack([np.random.normal(0, 1, (950, 2)),
np.random.normal(-1.8, 0.8, (50, 2))])
y = np.hstack([np.zeros(950), np.ones(50)])
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='none',
cmap=plt.cm.Accent)
Out[17]:
In [ ]:
from sklearn.svm import SVC
# First instantiate the "Support Vector Classifier" (SVC) model
# Next split the data (X and y) into a training and test set
# fit the model to the training data
# compute y_pred, the predicted labels of the test data
In [ ]:
# Now that this is finished, we'll compute the classification rate
print "accuracy:", np.sum(y_test == y_pred) * 1. / len(y_test)
Did you do well? Would you say that this is a good classification scheme for the problem?
The problem here is that we might not care how well we can classify the background, but might instead be concerned with successfully pulling-out an uncontaminated set of foreground sources. We can get at this by computing statistics such as the precision, the recall, and the f1 score:
In [19]:
from sklearn import metrics
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)
These are ways of taking into account not just the classification results, but the results relative to the true category.
$$ {\rm accuracy} \equiv \frac{\rm correct~labels}{\rm total~samples} $$
$$ {\rm precision} \equiv \frac{\rm true~positives}{\rm true~positives + false~positives} $$
$$ {\rm recall} \equiv \frac{\rm true~positives}{\rm true~positives + false~negatives} $$
$$ F_1 \equiv 2 \frac{\rm precision \cdot recall}{\rm precision + recall} $$
The accuracy, precision, recall, and f1-score all range from 0 to 1, with 1 being optimal. Here we've used the following definitions:
1
which are actually 1
1
which are actually 0
0
which are actually 0
0
which are actually 1
We can quickly compute a summary of these statistics using scikit-learn's provided convenience function:
In [20]:
print metrics.classification_report(y_test, y_pred,
target_names=['background', 'foreground'])
This tells us that, though the overall correct classification rate is 97%, we only correctly identify 67% of the desired samples, and those that we label as positives are only 83% correct! This is why you should make sure to carefully choose your metric when validating a model.
Compare the K Neighbors Classifier, the Gaussian Naive Bayes Classifier, and the Support Vector Machine Classifier on this toy dataset, using their default parameters. Which is most accurate? Which is most precise? Which has the best recall?
Imagine that you are a medical doctor, and here the features are biometric measurements, and the labels are
For your patients, which classifier might you choose? (That is, would you be more concerned with accuracy, precision, or recall?)
In [ ]:
In [ ]:
Using the simple train/test split as above can be useful, but there is a disadvantage: You're ignoring a portion of your dataset. One way to address this is to use cross-validation.
The simplest cross-validation scheme involves running two trials, where you split the data into two parts, first training on one, then training on the other:
In [21]:
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.5)
print X1.shape
print X2.shape
In [22]:
y2_pred = SVC().fit(X1, y1).predict(X2)
y1_pred = SVC().fit(X2, y2).predict(X1)
print np.mean([metrics.precision_score(y1, y1_pred),
metrics.precision_score(y2, y2_pred)])
This is known as two-fold cross-validation, and is a special case of K-fold cross validation.
Because it's such a common routine, scikit-learn has a K-fold cross-validation scheme built-in:
In [23]:
from sklearn.cross_validation import cross_val_score
# Let's do a 2-fold cross-validation of the SVC estimator
print cross_val_score(SVC(), X, y, cv=2, scoring='precision')
It's also possible to use sklearn.cross_validation.KFold
and sklearn.cross_validation.StratifiedKFold
directly, as well as other cross-validation models which you can find in the cross_validation
module.
In [ ]:
This exercise you just completed is an example of a grid search for model evaluation. Again, because this is such a common task, Scikit-learn has a grid search tool built-in, which is used as follows. Note that GridSearchCV
has a fit
method: it is a meta-estimator: an estimator over estimators!
In [25]:
from sklearn.grid_search import GridSearchCV
clf = SVC()
Crange = np.logspace(-2, 2, 40)
grid = GridSearchCV(clf, param_grid={'C': Crange},
scoring='precision', cv=5)
grid.fit(X, y)
print "best parameter choice:", grid.best_params_
In [26]:
scores = [g[1] for g in grid.grid_scores_]
plt.semilogx(Crange, scores);
Grid search can come in very handy when you're tuning a model for a particular task.
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
# construct the K Neighbors classifier
# Use GridSearchCV to find the best accuracy given choice of ``n_neighbors``
# Plot the accuracy as a function of the number of neighbors.
# Does this change significantly if you use more/fewer folds?
Now that we've gone over the basics of metrics, validation, and cross-validation, it's time to go into even more depth regarding model selection.
The issues associated with validation and cross-validation are some of the most important aspects of the practice of machine learning. Selecting the optimal model for your data is vital, and is a piece of the problem that is not often appreciated by machine learning practitioners.
Of core importance is the following question:
If our estimator is underperforming, how should we move forward?
The answer is often counter-intuitive. In particular, Sometimes using a more complicated model will give worse results. Also, Sometimes adding training data will not improve your results. The ability to determine what steps will improve your model is what separates the successful machine learning practitioners from the unsuccessful.
For this section, we'll work with a simple 1D regression problem. This will help us to
easily visualize the data and the model, and the results generalize easily to higher-dimensional
datasets. We'll explore a simple linear regression problem.
This can be accomplished within scikit-learn with the sklearn.linear_model
module.
We'll create a simple nonlinear function that we'd like to fit
In [27]:
def test_func(x, err=0.5):
y = 10 - 1. / (x + 0.1)
if err > 0:
y = np.random.normal(y, err)
return y
Now let's create a realization of this dataset:
In [28]:
def make_data(N=40, error=1.0, random_seed=1):
# randomly sample the data
np.random.seed(1)
X = np.random.random(N)[:, np.newaxis]
y = test_func(X.ravel(), error)
return X, y
In [29]:
X, y = make_data(40, error=1)
plt.scatter(X.ravel(), y)
Out[29]:
Now say we want to perform a regression on this data. Let's use the built-in linear regression function to compute a fit:
In [30]:
X_test = np.linspace(-0.1, 1.1, 500)[:, None]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_test = model.predict(X_test)
plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)
We have fit a straight line to the data, but clearly this model is not a good choice. We say that this model is biased, or that it under-fits the data.
Let's try to improve this by creating a more complicated model. We can do this by adding degrees of freedom, and computing a polynomial regression over the inputs. Let's make this easier by creating a quick PolynomialRegression estimator:
In [31]:
class PolynomialRegression(LinearRegression):
"""Simple Polynomial Regression to 1D data"""
def __init__(self, degree=1, **kwargs):
self.degree = degree
LinearRegression.__init__(self, **kwargs)
def fit(self, X, y):
if X.shape[1] != 1:
raise ValueError("Only 1D data valid here")
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.fit(self, Xp, y)
def predict(self, X):
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.predict(self, Xp)
Now we'll use this to fit a quadratic curve to the data.
In [32]:
model = PolynomialRegression(degree=2)
model.fit(X, y)
y_test = model.predict(X_test)
plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)
This reduces the mean squared error, and makes a much better fit. What happens if we use an even higher-degree polynomial?
In [33]:
model = PolynomialRegression(degree=30)
model.fit(X, y)
y_test = model.predict(X_test)
plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
plt.ylim(-4, 14)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)
When we increase the degree to this extent, it's clear that the resulting fit is no longer reflecting the true underlying distribution, but is more sensitive to the noise in the training data. For this reason, we call it a high-variance model, and we say that it over-fits the data.
Clearly, computing the error on the training data is not enough (we saw this previously). But computing this training error can help us determine what's going on: in particular, comparing the training error and the validation error can give you an indication of how well your data is being fit.
Let's do this:
In [39]:
degrees = np.arange(1, 30)
X, y = make_data(100, error=1.0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
training_error = []
test_error = []
mse = metrics.mean_squared_error
for d in degrees:
model = PolynomialRegression(d).fit(X_train, y_train)
training_error.append(mse(model.predict(X_train), y_train))
test_error.append(mse(model.predict(X_test), y_test))
# note that the test error can also be computed via cross-validation
plt.plot(degrees, training_error, label='training')
plt.plot(degrees, test_error, label='test')
plt.legend()
plt.xlabel('degree')
plt.ylabel('MSE')
Out[39]:
This is a typical bias/variance plot.
On the Left Side of the plot, we have a high-bias model, characterized by the training and test data showing equally bad performance. This shows that the model is under-fitting the data, because it does equally poorly on both known and unknown values.
On the Right Side of the plot, we have a high-variance model, characterized by a divergence of the training and test data. Here the model is over-fitting the data: in other words, the particular noise distribution of the input data has too much effect on the result.
The optimal model here will be around the point where the test error is minimized.
Let's appy this learning curve idea to a Decision Tree regression on this dataset.
sklearn.tree.DecisionTree
takes a number of hyper-parameters, but one that can really control the over-fitting, under-fitting is the max_depth
parameter especially controls the tradeoff between bias and variance.
Generate a dataset like the one above, but containing 500 points, and use a Decision Tree classification to explore the effect of the max_depth
parameter on the bias/variance tradeoff. What is the optimal value?
In [ ]:
One popular and effective way to address over-fitting is to use Ensemble methods. sklearn.ensemble.RandomForestRegressor
uses multiple randomized decision trees and averages their results. The ensemble of estimators can often do better than any individual estimator for data that is over-fit. Repeat the above experiment using the Random Forest regressor, with 10 trees. What is the best max_depth
for this model? Does the accuracy improve?
In [ ]:
The exact turning-point of the tradeoff between bias and variance is highly dependent on the number of training points used. Here we'll illustrate the use of learning curves, which display this property.
The idea is to plot the mean-squared-error for the training and test set as a function of Number of Training Points
In [70]:
X, y = make_data(200, error=1.0)
degree = 3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
N_range = np.linspace(15, X_train.shape[0], 20).astype(int)
def plot_learning_curve(degree=3):
training_error = []
test_error = []
mse = metrics.mean_squared_error
for N in N_range:
XN = X_train[:N]
yN = y_train[:N]
model = PolynomialRegression(degree).fit(XN, yN)
training_error.append(mse(model.predict(XN), yN))
test_error.append(mse(model.predict(X_test), y_test))
plt.plot(N_range, training_error, label='training')
plt.plot(N_range, test_error, label='test')
plt.plot(N_range, np.ones_like(N_range), ':k')
plt.legend()
plt.title('degree = {0}'.format(degree))
plt.xlabel('num. training points')
plt.ylabel('MSE')
plot_learning_curve(3)
This shows a typical learning curve: for very few training points, there is a large separation between the training and test error, which indicates over-fitting. Given the same model, for a large number of training points, the training and testing errors converge, which indicates potential under-fitting.
It is easy to see that, in this plot, if you'd like to reduce the MSE down to the nominal value of 1.0 (which is the magnitude of the scatter we put in when constructing the data), then adding more samples will Never get you there. For $d=3$, we've already converged. What about for $d=1$?
In [71]:
plot_learning_curve(2)
We see that it hasn't converged yet: you still might expect to decrease the test error by adding more points.
What about the other extreme?
In [72]:
plot_learning_curve(5)
For $d=5$, it's converged, but to a better value than for $d=3$.
Thus you can bring the two curves closer together by adding more points, but you can bring the convergence level down only by adding complexity to the model.
This can be very useful, because it gives you a hint about how to improve upon results which are sub-par. To make this more concrete, imagine an astronomy project in which the results are not robust enough. You must think about whether to spend your valuable telescope time observing more objects to get a larger training set, or more attributes of each object in order to improve the model. The answer to this question is important, and can be addressed using these metrics.
Let's return to our digits data. Using what you've learned here, explore several classifiers and find the optimal classifier for this dataset. What is the best accuracy you can attain on a 5-fold cross-validation over the data? Are there indications that more training data would be helpful in improving this accuracy?
In [ ]: