This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
In the previous section, we saw the basic recipe for applying a supervised machine learning model:
The choice of model and choice of hyperparameters are perhaps the most important part
.
There are some pitfalls that you must avoid to do this effectively.
In [1]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
Next we choose a model and hyperparameters.
Here we'll use a k-neighbors classifier with n_neighbors=1
.
In [3]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
Then
In [4]:
model.fit(X, y)
y_model = model.predict(X)
Finally, we compute the fraction of correctly labeled points:
In [5]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)
Out[5]:
We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model!
In [9]:
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
train_size=0.5, test_size = 0.5)
# fit the model on one set of data
model.fit(X1, y1)
# evaluate the model on the second set of data
y2_model = model.predict(X2)
accuracy_score(y2, y2_model)
Out[9]:
We see here a more reasonable result:
One disadvantage of using a holdout set for model validation
cross-validation does a sequence of fits where each subset of the data is used both as a training set and as a validation set.
In [10]:
# Here we do two validation trials,
# alternately using each half of the data as a holdout set.
# Using the split data from before, we could implement it like this:
y2_model = model.fit(X1, y1).predict(X2)
y1_model = model.fit(X2, y2).predict(X1)
accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)
Out[10]:
What comes out are two accuracy scores, which
We could expand on this idea to use even more trials, and more folds in the data—for example, here is a visual depiction of five-fold cross-validation:
Here we split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data.
In [12]:
# We can use Scikit-Learn's ``cross_val_score`` convenience routine to do it succinctly:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)
Out[12]:
Repeating the validation across different subsets of the data gives us an even better idea of the performance of the algorithm.
Scikit-Learn implements a number of useful cross-validation schemes that are useful in particular situations;
cross_validation
module.
In [16]:
from sklearn.model_selection import LeaveOneOut
scores = cross_val_score(model, X, y, cv=LeaveOneOut())
scores
Out[16]:
Because we have 150 samples, the leave one out cross-validation yields scores for 150 trials, and
In [17]:
scores.mean()
Out[17]:
Other cross-validation schemes can be used similarly.
sklearn.cross_validation
submodule, or The answer to this question is often counter-intuitive.
In particular, sometimes
The ability to determine what steps will improve your model
is what separates the successful machine learning practitioners from the unsuccessful.
The model on the left
The model on the right
To look at this in another light, consider what happens if we use these two models to predict the y-value for some new data. In the following diagrams, the red/lighter points indicate data that is omitted from the training set:
$R^2$ score, or coefficient of determination
than simply taking the mean of the data
From the scores associated with these two models, we can make an observation that holds more generally:
If we imagine that we have some ability to tune the model complexity, we would expect the training score and validation score to behave as illustrated in the following figure:
The diagram is often called a validation curve:
using cross-validation to compute the validation curve.
a polynomial regression model:
For example, a degree-1 polynomial fits a straight line to the data; for model parameters $a$ and $b$:
$$ y = ax + b $$A degree-3 polynomial fits a cubic curve to the data; for model parameters $a, b, c, d$:
$$ y = ax^3 + bx^2 + cx + d $$In Scikit-Learn, we can implement this with a simple linear regression combined with the polynomial preprocessor.
We will use a pipeline to string these operations together (we will discuss polynomial features and pipelines more fully in Feature Engineering):
In [29]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression(**kwargs))
Now let's create some data to which we will fit our model:
In [30]:
import numpy as np
def make_data(N, err=1.0, rseed=1):
# randomly sample the data
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 10 - 1. / (X.ravel() + 0.1)
if err > 0:
y += err * rng.randn(N)
return X, y
X, y = make_data(40)
We can now visualize our data, along with polynomial fits of several degrees:
In [31]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # plot formatting
X_test = np.linspace(-0.1, 1.1, 500)[:, None]
plt.scatter(X.ravel(), y, color='black')
axis = plt.axis()
for degree in [1, 3, 5]:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degree))
plt.xlim(-0.1, 1.0)
plt.ylim(-2, 12)
plt.legend(loc='best');
The knob controlling model complexity in this case is the degree of the polynomial
what degree of polynomial provides a suitable trade-off between bias (under-fitting) and variance (over-fitting)?
We can make progress in this by visualizing the validation curve for this particular data and model;
validation_curve
convenience routine provided by Scikit-Learn.
In [32]:
from sklearn.learning_curve import validation_curve
degree = np.arange(0, 21)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
'polynomialfeatures__degree',
degree, cv=7)
plt.plot(degree, np.median(train_score, 1), color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
This shows precisely the qualitative behavior we expect:
The optimal trade-off between bias and variance is found for a third-order polynomial;
In [33]:
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = PolynomialRegression(3).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);
Notice that finding this optimal model did not actually require us to compute the training score,
In [34]:
X2, y2 = make_data(200)
plt.scatter(X2.ravel(), y2);
We will duplicate the preceding code to plot the validation curve for this larger dataset;
In [15]:
degree = np.arange(21)
train_score2, val_score2 = validation_curve(PolynomialRegression(), X2, y2,
'polynomialfeatures__degree', degree, cv=7)
plt.plot(degree, np.median(train_score2, 1), color='blue', label='training score')
plt.plot(degree, np.median(val_score2, 1), color='red', label='validation score')
plt.plot(degree, np.median(train_score, 1), color='blue', alpha=0.3, linestyle='dashed')
plt.plot(degree, np.median(val_score, 1), color='red', alpha=0.3, linestyle='dashed')
plt.legend(loc='lower center')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
The solid lines show the new results, while the fainter dashed lines show the results of the previous smaller dataset.
Thus we see that the behavior of the validation curve has not one but two important inputs:
A plot of the training/validation score with respect to the size of the training set is known as a learning curve.
The general behavior we would expect from a learning curve is this:
With these features in mind, we would expect a learning curve to look qualitatively like that shown in the following figure:
The notable feature of the learning curve
In [41]:
from sklearn.learning_curve import learning_curve
import warnings
warnings.filterwarnings("ignore")
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for i, degree in enumerate([2, 9]):
N, train_lc, val_lc = learning_curve(PolynomialRegression(degree),
X, y, cv=7,
train_sizes=np.linspace(0.3, 1, 25))
ax[i].plot(N, np.mean(train_lc, 1), color='blue', label='training score')
ax[i].plot(N, np.mean(val_lc, 1), color='red', label='validation score')
ax[i].hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
color='gray', linestyle='dashed')
ax[i].set_ylim(0, 1)
ax[i].set_xlim(N[0], N[-1])
ax[i].set_xlabel('training size', fontsize = 30)
ax[i].set_ylabel('score', fontsize = 30)
ax[i].set_title('degree = {0}'.format(degree), size=24)
ax[i].legend(loc='best', fontsize = 30)
#fig.savefig('figures/05.03-learning-curve2.png')
<img src = './img/figures/05.03-learning-curve2.png', width = 800px>
This is a valuable diagnostic
When your learning curve has already converged
The only way to increase the converged score is to use a different (usually more complicated) model.
If we were to add even more data points, the learning curve for the more complicated model would eventually converge.
Plotting a learning curve for your particular choice of model and dataset can help you to make this type of decision about how to move forward in improving your analysis.
The trade-off between bias and variance, and its dependence on model complexity and training set size.
In practice, models generally have more than one knob to turn
validation and learning curves
change from lines to multi-dimensional surfaces.Scikit-Learn provides automated tools to do this in the grid search module
.
Here is an example of using grid search to find the optimal polynomial model.
We will explore a three-dimensional grid of model features;
This can be set up using Scikit-Learn's GridSearchCV
meta-estimator:
In [42]:
from sklearn.grid_search import GridSearchCV
param_grid = {'polynomialfeatures__degree': np.arange(21),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
Notice that like a normal estimator, this has not yet been applied to any data.
Calling the fit()
method will fit the model at each grid point, keeping track of the scores along the way:
In [43]:
grid.fit(X, y);
Now that this is fit, we can ask for the best parameters as follows:
In [44]:
grid.best_params_
Out[44]:
Finally, if we wish, we can use the best model and show the fit to our data using code from before:
In [45]:
model = grid.best_estimator_
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = model.fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test, hold=True);
plt.axis(lim);
The grid search provides many more options, including the ability
For information, see the examples in In-Depth: Kernel Density Estimation and Feature Engineering: Working with Images, or refer to Scikit-Learn's grid search documentation.
In this section, we have begun to explore the concept of model validation and hyperparameter optimization, focusing on intuitive aspects of the bias–variance trade-off and how it comes into play when fitting models to data.
In particular, we found that the use of a validation set or cross-validation approach is vital when tuning parameters in order to avoid over-fitting for more complex/flexible models.