In [1]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [ ]:
In [2]:
from sklearn import datasets
diabetes = datasets.load_diabetes()
# Description at http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
X = diabetes.data
y = diabetes.target
print(X.shape, y.shape)
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(X, y)
plt.plot(y, clf.predict(X), 'k.')
plt.show()
All the different models in scikit-learn follow a consistent structure.
These are the basic components with additional methods added when needed. For example, classifiers also have
Before we consider whether we have a good model, or which model to choose, we must first decide on how we will evaluate our models.
As part of our evaluation having a single number with which to compare models can be very useful. Choosing a metric that is as close a representation of our goal as possible enables many models to be automatically compared. This can be important when choosing model parameters or comparing different types of algorithm.
Even if we have a metric we feel is reasonable it can be worthwhile considering in detail the predictions made by any model. Some questions to ask:
The metric we decide to use will depend on the type of problem we have (regression or classification) and what aspects of the prediction are most important to us. For example, a decision we might have to make is between:
For these two situations in a regression task we might choose mean_squared_error and mean_absolute_error.
There are lists for regression metrics and classification metrics.
We can apply the mean_squared_error metric to the linear regression model on the diabetes dataset:
In [3]:
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
clf = linear_model.LinearRegression()
clf.fit(X, y)
plt.plot(y, clf.predict(X), 'k.')
plt.show()
from sklearn import metrics
metrics.mean_squared_error(y, clf.predict(X))
Out[3]:
Although this single number might seem unimpressive, metrics are a key component for model evaluation. As a simple example, we can perform a permutation test to determine whether we might see this performance by chance.
In [4]:
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
clf = linear_model.LinearRegression()
clf.fit(X, y)
error = metrics.mean_squared_error(y, clf.predict(X))
rounds = 1000
np.random.seed(0)
errors = []
for i in range(rounds):
y_shuffle = y.copy()
np.random.shuffle(y_shuffle)
clf_shuffle = linear_model.LinearRegression()
clf_shuffle.fit(X, y_shuffle)
errors.append(metrics.mean_squared_error(y_shuffle, clf_shuffle.predict(X)))
better_models_by_chance = len([i for i in errors if i <= error])
if better_models_by_chance > 0:
print('Probability of observing a mean_squared_error of {0} by chance is {1}'.format(error,
better_models_by_chance / rounds))
else:
print('Probability of observing a mean_squared_error of {0} by chance is <{1}'.format(error,
1 / rounds))
In [5]:
from sklearn import tree
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
clf = tree.DecisionTreeRegressor()
clf.fit(X, y)
plt.plot(y, clf.predict(X), 'k.')
plt.show()
metrics.mean_squared_error(y, clf.predict(X))
Out[5]:
In [6]:
from sklearn import neighbors
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
clf = neighbors.KNeighborsRegressor(n_neighbors=1)
clf.fit(X, y)
plt.plot(y, clf.predict(X), 'k.')
plt.show()
metrics.mean_squared_error(y, clf.predict(X))
Out[6]:
Both these models appear to give perfect solutions but all they do is map our test samples back to the training samples and return the associated value.
To understand how our model truly performs we need to evaluate the performance on previously unseen samples. The general approach is to divide a dataset into training, validation and test datasets. Each model is trained on the training dataset. Multiple models can then be compared by evaluating the model against the validation dataset. There is still the potential of choosing a model that performs well on the validation dataset by chance so a final check is made against a test dataset.
This unfortunately means that part of our, often expensively gathered, data can't be used to train our model. Although it is important to leave out a test dataset an alternative approach can be used for the validation dataset. Rather than just building one model we can build multiple models, each time leaving out a different validation dataset. Our validation score is then the average across each of the models. This is known as cross-validation.
Scikit-learn provides classes to support cross-validation but a simple solution can also be implemented directly. Below we will separate out a test dataset to evaluate the nearest neighbor model.
In [7]:
from sklearn import neighbors
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
np.random.seed(0)
split = np.random.random(y.shape) > 0.3
X_train = X[split]
y_train = y[split]
X_test = X[np.logical_not(split)]
y_test = y[np.logical_not(split)]
print(X_train.shape, X_test.shape)
clf = neighbors.KNeighborsRegressor(1)
clf.fit(X_train, y_train)
plt.plot(y_test, clf.predict(X_test), 'k.')
plt.show()
metrics.mean_squared_error(y_test, clf.predict(X_test))
Out[7]:
In [8]:
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
np.random.seed(0)
split = np.random.random(y.shape) > 0.3
X_train = X[split]
y_train = y[split]
X_test = X[np.logical_not(split)]
y_test = y[np.logical_not(split)]
print(X_train.shape, X_test.shape)
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
plt.plot(y_test, clf.predict(X_test), 'k.')
plt.show()
metrics.mean_squared_error(y_test, clf.predict(X_test))
Out[8]:
Scikit-learn includes a variety of different models. The most commonly used algorithms probably include the following:
We have already seen several examples of regression. The basic form is:
$$f(X) = \beta_{0} + \sum_{j=1}^p X_j\beta_j$$Each feature is multipled by a coefficient and then the sum returned. This value is then transformed for classification to limit the value to the range 0 to 1.
Support vector machines attempt to project samples into a higher dimensional space such that they can be divided by a hyperplane. A good explanation can be found in this article.
Nearest neighbor methods identify a number of samples from the training set that are close to the new sample and then return the average or most common value depending on the task.
Decision trees attempt to predict the value of a new sample by learning simple rules from the training samples.
Ensembles are combinations of other models. Combining different models together can improve performance by boosting generalizability. An average or most common value from the models is returned.
Boosting builds one model and then attempts to reduce the errors with the next model. At each stage the bias in the model is reduced. In this way many weak predictors can be combined into one much more powerful predictor.
I often begin with an ensemble or boosting approach as they typically give very good performance without needing to be carefully optimized. Many of the other algorithms are sensitive to their parameters.
Many of the models require several different parameters to be specified. Their performance is typically heavily influenced by these parameters and choosing the best values is vital in developing the best model.
Some models have alternative implementations that handle parameter selection in an efficient way.
In [9]:
from sklearn import datasets
diabetes = datasets.load_diabetes()
# Description at http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
X = diabetes.data
y = diabetes.target
print(X.shape, y.shape)
from sklearn import linear_model
clf = linear_model.LassoCV(cv=20)
clf.fit(X, y)
print('Alpha chosen was ', clf.alpha_)
plt.plot(y, clf.predict(X), 'k.')
Out[9]:
There is an expanded example in the documentation.
There are also general classes to handle parameter selection for situations when dedicated classes are not available. As we will often have parameters in preprocessing steps these general classes will be used much more often.
In [10]:
from sklearn import grid_search
from sklearn import neighbors
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
np.random.seed(0)
split = np.random.random(y.shape) > 0.3
X_train = X[split]
y_train = y[split]
X_test = X[np.logical_not(split)]
y_test = y[np.logical_not(split)]
print(X_train.shape, X_test.shape)
knn = neighbors.KNeighborsRegressor()
parameters = {'n_neighbors':[1,2,3,4,5,6,7,8,9,10]}
clf = grid_search.GridSearchCV(knn, parameters)
clf.fit(X_train, y_train)
plt.plot(y_test, clf.predict(X_test), 'k.')
plt.show()
print(metrics.mean_squared_error(y_test, clf.predict(X_test)))
clf.get_params()
Out[10]:
In [15]:
# 1. Load the handwritten digits dataset and choose an appropriate metric
# 2. Divide the data into a training and test dataset
from sklearn import datasets, metrics, ensemble
digits = datasets.load_digits()
X = digits.data
y = digits.target
print(X.shape, y.shape)
np.random.seed(0)
split = np.random.random(y.shape) > 0.3
X_train = X[split]
y_train = y[split]
X_test = X[np.logical_not(split)]
y_test = y[np.logical_not(split)]
In [16]:
# 3. Build a RandomForestClassifier on the training dataset,
# using cross-validation to evaluate performance
scores = []
cv = 10
for i in range(cv):
split = np.random.random(y_train.shape) > 1/cv
X_train_train = X_train[split]
y_train_train = y_train[split]
X_val = X_train[np.logical_not(split)]
y_val = y_train[np.logical_not(split)]
clf = ensemble.RandomForestClassifier(n_estimators=100)
clf.fit(X_train_train, y_train_train)
scores.append(metrics.accuracy_score(y_val, clf.predict(X_val)))
print(scores, np.array(scores).mean())
In [19]:
from sklearn import cross_validation
clf = ensemble.RandomForestClassifier(n_estimators=100)
cross_validation.cross_val_score(estimator=clf, X=X_train, y=y_train, cv=10, scoring='accuracy')
Out[19]:
In [21]:
# 6. Comparing the true values with the predictions from the best model identify the
# numbers that are most commonly confused.
clf = ensemble.RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
metrics.confusion_matrix(y_test, clf.predict(X_test))
Out[21]:
In [23]:
# 4. Choose another classification algorithm and apply it to the digits dataset.
# 5. Use grid search to find the optimal parameters for the chosen algorithm.
from sklearn import svm, grid_search
clf = svm.SVC()
parameters = {'C':[10, 1, 0.1, 0.001, 0.0001, 0.00001],
'kernel':['linear', 'poly', 'rbf', 'sigmoid']}
clf = grid_search.GridSearchCV(clf, parameters)
clf.fit(X_train, y_train)
clf.get_params()
Out[23]:
In [24]:
metrics.accuracy_score(y_test, clf.predict(X_test))
Out[24]:
In [ ]: